The attached Perl script print cuts from all lines in a plaintext file that contain non-ASCII bytes. With option -m, it looks for malformed and overlong UTF-8 sequences instead. Usefull for reviewing files with unknown encoding manually.
It is somewhat tedious to write down regular expressions that match malformed or overlong UTF-8 sequences. They might not be ultra-efficient, either. I hope that's of use to someone out there ... Markus -- Markus Kuhn, Computer Lab, Univ of Cambridge, GB http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__
#!/usr/bin/perl # This tool shows extracts from all lines that contain non-ASCII bytes. # It can also look for malformed UTF-8 sequences (option -m) instead. # Markus Kuhn -- http://www.cl.cam.ac.uk/~mgk25/ -- 2003-03-15 $nonascii = '[\x80-\xff]'; $utf8malformed = '[\x00-\x7f][\x80-\xbf]+|^[\x80-\xbf]+|'. '[\xc0-\xdf][\x00-\x7f\xc0-\xff]|'. '[\xc0-\xdf][\x80-\xbf]{2}|'. '[\xe0-\xef][\x80-\xbf]{0,1}[\x00-\x7f\xc0-\xff]|'. '[\xe0-\xef][\x80-\xbf]{3}|'. '[\xf0-\xf7][\x80-\xbf]{0,2}[\x00-\x7f\xc0-\xff]|'. '[\xf0-\xf7][\x80-\xbf]{4}|'. '[\xf8-\xfb][\x80-\xbf]{0,3}[\x00-\x7f\xc0-\xff]|'. '[\xf8-\xfb][\x80-\xbf]{5}|'. '[\xfc-\xfd][\x80-\xbf]{0,4}[\x00-\x7f\xc0-\xff]|'. '\xfe|\xff'; $utf8overlong = '[\xc0-\xc1][\x80-\xbf]|'. '\xe0[\x80-\x9f][\x80-\xbf]|'. '\xf0[\x80-\x8f][\x80-\xbf]{2}|'. '\xf8[\x80-\x87][\x80-\xbf]{3}|'. '\xfc[\x80-\x83][\x80-\xbf]{4}'; $match = $nonascii; $context = 10; $bhex = "\e[7m"; $ehex = "\e[m"; while ($ARGV[0] =~ /^-/) { $_ = shift @ARGV; if (/^-r$/) { $raw = 1; } elsif (/^-c(\d+)$/) { $context = $1; } elsif (/^-p$/) { $bhex = $ehex = ''; } elsif (/^-m$/) { $match = "$utf8malformed|$utf8overlong"; } else { print <<EOT; Usage: [options] files ... Options: -c<int> list <int> context bytes -r pass raw non-ASCII bytes through -p do not add terminal control sequences for highlighting -m match malformed UTF-8 sequences instead of non-ASCII bytes -h list command line options EOT exit 1; } } while(<>) { if (/.{0,$context}($match).{0,$context}/) { $_ = $&; s/([\x80-\xff])/sprintf("$bhex<%02x>$ehex",ord($1))/eg unless $raw; print "$ARGV:$.: ...$_...\n"; } close(ARGV) if (eof); }
