[Perl-unix-users] auto-detecting file encoding

DZ-Jay Sun, 18 Jun 2006 00:19:09 -0700

Hello:

I need to process the text of thousands of files automatically, withsimple regexp substitutions. The problem I have is that, although allfiles are plaintext, they have been written with a variety of programsin Windows, so they employ diverse encodings. For example, some are in'utf-8', others in 'windows-1252', and some in 'latin-1'.

I was in the process of whipping out a script to run through theseencodings (using Encode::decode) to try to find the best one for each,but I came across an unforgivable realization: a single two-byteUnicode character in UTF-8 look suspiciously like two single-byte ANSI(windows-1252) characters. The script will run on Red Hat Linux,running Perl v5.8.5 (built for x86_64-linux-thread-multi).

This is what I have so far. It seems to work fine (so far), but I'mnot sure how reliable it is:


### START OF CODE

open(TXT_FH, "< $insert_path") die("Unable to open file $insert_path:$!");


# Attempt to read the file and decode the characters using
# various encoding schemes.  This is to work around the
# mess of formats and characters in the insert files.
my @encodings = ('utf-8', 'windows-1252', 'iso-8859-1');
my $raw_data = '';
my $utf8_txt = '';
my $enc_idx  = 0;
while( my $bytes = read(TXT_FH, my $buffer, 16) )
    {
    # buffer may end in a partial character so we append
    $raw_data .= $buffer;
    DECODE: while ($raw_data)
        {
        if ($enc_idx > $#encodings)
            {

$utf8_txt .= Encode::decode('utf-8', $raw_data,Encode::FB_PERLQQ);

            $enc_idx = 0;
            last DECODE;
            }

        # $data now contains the unprocessed partial character

$utf8_txt .= Encode::decode($encodings[$enc_idx], $raw_data,Encode::FB_QUIET);

        $enc_idx++ if ($raw_data);
        }
    }
close(TXT_FH);

### END OF CODE

Is there a better way to detect them? I need to make sure to interpretthe encoding correctly, because later on I need to generate XML fileswith correct UTF-8.


_______________________________________________
Perl-Unix-Users mailing list
Perl-Unix-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

[Perl-unix-users] auto-detecting file encoding

Reply via email to