Re: Perl script to hunt for malformed/overlong UTF-8 sequences

Larry Wall Sat, 15 Mar 2003 18:56:32 -0800

On Sat, Mar 15, 2003 at 08:32:24PM +0000, Markus Kuhn wrote:
: The attached Perl script print cuts from all lines in a plaintext file
: that contain non-ASCII bytes. With option -m, it looks for malformed and
: overlong UTF-8 sequences instead. Usefull for reviewing files with
: unknown encoding manually.


I haven't tried it, but from a cursory inspection I don't believe
it'll work under Perl 5.8.0 in any UTF-8 locale unless you throw one
of these in there at the top:

    use bytes;
    binmode(STDIN,":bytes");
    use open IO => ':bytes:std';

And if you also want it to work with ancient versions of Perl, your
best bet is something like:

    eval 'binmode(STDIN,":bytes"); binmode(STDOUT,":bytes")';

Sorry 'bout that.  I didn't expect RedHat 8.0 to turn on UTF-8 for you
by default, and I shouldn't have believed what I read on this mailinglist
about the degree of committment implied by UTF-8 locales...  :-)

Larry
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Perl script to hunt for malformed/overlong UTF-8 sequences

Reply via email to