Hello:
I need to process the text of thousands of files automatically, with simple regexp substitutions. The problem I have is that, although all files are plaintext, they have been written with a variety of programs in Windows, so they employ diverse encodings. For example, some are in 'utf-8', others in 'windows-1252', and some in 'latin-1'.

I was in the process of whipping out a script to run through these encodings (using Encode::decode) to try to find the best one for each, but I came across an unforgivable realization: a single two-byte Unicode character in UTF-8 look suspiciously like two single-byte ANSI (windows-1252) characters. The script will run on Red Hat Linux, running Perl v5.8.5 (built for x86_64-linux-thread-multi).

This is what I have so far. It seems to work fine (so far), but I'm not sure how reliable it is:

### START OF CODE

open(TXT_FH, "< $insert_path") die("Unable to open file $insert_path: $!");

# Attempt to read the file and decode the characters using
# various encoding schemes.  This is to work around the
# mess of formats and characters in the insert files.
my @encodings = ('utf-8', 'windows-1252', 'iso-8859-1');
my $raw_data = '';
my $utf8_txt = '';
my $enc_idx  = 0;
while( my $bytes = read(TXT_FH, my $buffer, 16) )
    {
    # buffer may end in a partial character so we append
    $raw_data .= $buffer;
    DECODE: while ($raw_data)
        {
        if ($enc_idx > $#encodings)
            {
$utf8_txt .= Encode::decode('utf-8', $raw_data, Encode::FB_PERLQQ);
            $enc_idx = 0;
            last DECODE;
            }

        # $data now contains the unprocessed partial character
$utf8_txt .= Encode::decode($encodings[$enc_idx], $raw_data, Encode::FB_QUIET);
        $enc_idx++ if ($raw_data);
        }
    }
close(TXT_FH);

### END OF CODE

Is there a better way to detect them? I need to make sure to interpret the encoding correctly, because later on I need to generate XML files with correct UTF-8.

_______________________________________________
Perl-Unix-Users mailing list
Perl-Unix-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to