Re: Perl script to hunt for malformed/overlong UTF-8 sequences

Jungshik Shin Tue, 18 Mar 2003 18:58:28 -0800

Markus Kuhn wrote:

The attached Perl script print cuts from all lines in a plaintext file that contain non-ASCII bytes. With option -m, it looks for malformed and overlong UTF-8 sequences instead. Usefull for reviewing files with unknown encoding manually.

It may be a good idea to filter out 'UTF-8' representation of surrogate codepoints (0x0d800 - 0xdfff) as well. That is, the following can be added to $utf8malformed

\xed[\xa0-\bf][\x80-\xbf]

Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Perl script to hunt for malformed/overlong UTF-8 sequences

Reply via email to