Hello:
I need to process the text of thousands of files automatically, with
simple regexp substitutions. The problem I have is that, although all
files are plaintext, they have been written with a variety of programs
in Windows, so they employ diverse encodings. For example, some are in
'utf-8', others in 'windows-1252', and some in 'latin-1'.
I was in the process of whipping out a script to run through these
encodings (using Encode::decode) to try to find the best one for each,
but I came across an unforgivable realization: a single two-byte
Unicode character in UTF-8 look suspiciously like two single-byte ANSI
(windows-1252) characters. The script will run on Red Hat Linux,
running Perl v5.8.5 (built for x86_64-linux-thread-multi).
This is what I have so far. It seems to work fine (so far), but I'm
not sure how reliable it is:
### START OF CODE
open(TXT_FH, "< $insert_path") die("Unable to open file $insert_path:
$!");
# Attempt to read the file and decode the characters using
# various encoding schemes. This is to work around the
# mess of formats and characters in the insert files.
my @encodings = ('utf-8', 'windows-1252', 'iso-8859-1');
my $raw_data = '';
my $utf8_txt = '';
my $enc_idx = 0;
while( my $bytes = read(TXT_FH, my $buffer, 16) )
{
# buffer may end in a partial character so we append
$raw_data .= $buffer;
DECODE: while ($raw_data)
{
if ($enc_idx > $#encodings)
{
$utf8_txt .= Encode::decode('utf-8', $raw_data,
Encode::FB_PERLQQ);
$enc_idx = 0;
last DECODE;
}
# $data now contains the unprocessed partial character
$utf8_txt .= Encode::decode($encodings[$enc_idx], $raw_data,
Encode::FB_QUIET);
$enc_idx++ if ($raw_data);
}
}
close(TXT_FH);
### END OF CODE
Is there a better way to detect them? I need to make sure to interpret
the encoding correctly, because later on I need to generate XML files
with correct UTF-8.
_______________________________________________
Perl-Unix-Users mailing list
Perl-Unix-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs