The attached Perl script print cuts from all lines in a plaintext file
that contain non-ASCII bytes. With option -m, it looks for malformed and
overlong UTF-8 sequences instead. Usefull for reviewing files with
unknown encoding manually.

It is somewhat tedious to write down regular expressions that match
malformed or overlong UTF-8 sequences. They might not be
ultra-efficient, either.

I hope that's of use to someone out there ...

Markus

-- 
Markus Kuhn, Computer Lab, Univ of Cambridge, GB
http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__

#!/usr/bin/perl
# This tool shows extracts from all lines that contain non-ASCII bytes.
# It can also look for malformed UTF-8 sequences (option -m) instead.
# Markus Kuhn -- http://www.cl.cam.ac.uk/~mgk25/ -- 2003-03-15
$nonascii      = '[\x80-\xff]';
$utf8malformed = '[\x00-\x7f][\x80-\xbf]+|^[\x80-\xbf]+|'.
                 '[\xc0-\xdf][\x00-\x7f\xc0-\xff]|'.
                 '[\xc0-\xdf][\x80-\xbf]{2}|'.
                 '[\xe0-\xef][\x80-\xbf]{0,1}[\x00-\x7f\xc0-\xff]|'.
                 '[\xe0-\xef][\x80-\xbf]{3}|'.
                 '[\xf0-\xf7][\x80-\xbf]{0,2}[\x00-\x7f\xc0-\xff]|'.
                 '[\xf0-\xf7][\x80-\xbf]{4}|'.
                 '[\xf8-\xfb][\x80-\xbf]{0,3}[\x00-\x7f\xc0-\xff]|'.
                 '[\xf8-\xfb][\x80-\xbf]{5}|'.
                 '[\xfc-\xfd][\x80-\xbf]{0,4}[\x00-\x7f\xc0-\xff]|'.
                 '\xfe|\xff';
$utf8overlong  = '[\xc0-\xc1][\x80-\xbf]|'.
                 '\xe0[\x80-\x9f][\x80-\xbf]|'.
                 '\xf0[\x80-\x8f][\x80-\xbf]{2}|'.
                 '\xf8[\x80-\x87][\x80-\xbf]{3}|'.
                 '\xfc[\x80-\x83][\x80-\xbf]{4}';

$match = $nonascii;
$context = 10;
$bhex = "\e[7m";
$ehex = "\e[m";

while ($ARGV[0] =~ /^-/) {
    $_ = shift @ARGV;
    if (/^-r$/) {
        $raw = 1;
    } elsif (/^-c(\d+)$/) {
        $context = $1;
    } elsif (/^-p$/) {
        $bhex = $ehex = '';
    } elsif (/^-m$/) {
        $match = "$utf8malformed|$utf8overlong";
    } else {
        print <<EOT;
Usage: [options] files ...

Options:

    -c<int> list <int> context bytes
    -r      pass raw non-ASCII bytes through
    -p      do not add terminal control sequences for highlighting
    -m      match malformed UTF-8 sequences instead of non-ASCII bytes
    -h      list command line options

EOT
    exit 1;
    }
}

while(<>) {
    if (/.{0,$context}($match).{0,$context}/) {
        $_ = $&;
        s/([\x80-\xff])/sprintf("$bhex<%02x>$ehex",ord($1))/eg unless $raw;
        print "$ARGV:$.: ...$_...\n"; }
    close(ARGV) if (eof);
}

Reply via email to