Markus Kuhn wrote:

The attached Perl script print cuts from all lines in a plaintext file
that contain non-ASCII bytes. With option -m, it looks for malformed and
overlong UTF-8 sequences instead. Usefull for reviewing files with
unknown encoding manually.



It may be a good idea to filter out 'UTF-8' representation of surrogate codepoints
(0x0d800 - 0xdfff) as well. That is, the following can be added to $utf8malformed


\xed[\xa0-\bf][\x80-\xbf]

Jungshik





--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/



Reply via email to