Hi Greg et al,

Attached is my boilerplate routines for dealing with Unicode on Win32 and a couple of samples.

Cheers,
Mark

---
Mark Leighton
CLIC LAN Supervisor, Information Commons, University of Toronto
E-mail: mark{DOT}leighton{AT}utoronto.ca



-------- Original Message  --------
Subject: Re: trouble understanding unicode
From: Gaurav Vaidya <gau...@ggvaidya.com>
To: gai...@visioninfosoft.com
Cc: perl-win32-users@listserv.activestate.com
Date: Thursday, March 26, 2009 11:14:04 PM

Hi Greg,

On Mar 27, 2009, at 6:47 AM, Greg Aiken wrote:
the problem here is that the ‘msinfo.txt’ file is not written in (single byte per character, ascii) format. instead the first two bytes of the file happen to be (hexFF)(hexFE). Beyond the first two bytes, each human readable ascii character is represented with TWO BYTES, (hex-ascii character value)(hex00)
(hexFF) (hexFE) is the Byte-Order Mark (http://en.wikipedia.org/wiki/Byte-order_mark ), so yes, definitely Unicode, and - if I'm reading the Wikipedia article correctly - definitely either UTF-16 or UTF-32.

in addition, if anyone knows how to modify the following block so that I can effectively, read the records of this file, and convert the read record into ‘plain old ascii’ encoding – I would be most appreciative.

open (IN, ‘infile.txt’);
while ($rec = <IN>) {
convert_$rec_from_its_current_encoding? _to_simple_ascii_encoding; <<<<<<<<<< the magic code would go here
                        print $rec;
}

Okay, here's my understanding of what's going on: Perl 5.8 and above will try to load the file up in UTF-8, Perl's native string format. But the file you're trying to open appears to be in UTF-16 or UTF-32 (You can use the table in the Wikipedia article above to figure out which one it is). Searching at http://perldoc.perl.org/ brought me to http://perldoc.perl.org/Encode/Unicode.html , which seems to be Perl's way of handling Unicode which isn't UTF-8. Since it's part of the Encode method, you should be able to use: open(IN, '<:encoding(utf-32)', 'infile.txt') or die "Could not open 'infile.txt': $!"; to tell Perl to translate that file from UTF-32 into Perl's native UTF-8 while reading. Similarly, to write out to this file without changing its UTF-16/32ishness, you can use: open(OUT, '>:encoding(utf-32)', 'outfile.txt') or die "Could not open 'outfile.txt' for writing: $!";
so Perl converts its native UTF-8 into UTF-32 on output.

The Perl Cookbook backs me up on this [1] :-).

Once you've figured this out, let us know how you did it - I think it'll make a nice page for the Perl Win32 wiki (http://win32.perl.org/).

cheers,
Gaurav

[1] 
http://books.google.com/books?id=IzdJIax6J5oC&pg=PA335&lpg=PA335&dq=perl+opening+UTF-32&source=bl&ots=z6zl7q9efS&sig=HdQeMKL8NHjc5pi6gE5jAonqdCw&hl=en&ei=dEHMSeyOEZCw6wPtodCbBw&sa=X&oi=book_result&resnum=7&ct=result
_______________________________________________
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
use strict;
use warnings;

use Encode;
use Carp;

# Get encoding filter for writing a utf8 file (with no Byte-Order-Marker).
my $ex1_encoding = SetFileEncoding( format => 'utf8' );
open( my $ex1_fh, '>' . $ex1_encoding, 'test-utf8.txt' ) or die;

print( $ex1_fh "Hello World with utf-8\n" );

close( $ex1_fh );


# Get encoding filter for writing a UTF-8 file with a Byte-Order-Marker.
my ( $ex2_encoding, $ex2_bom ) = SetFileEncoding( format => 'utf8' );
open( my $ex2_fh, '>' . $ex2_encoding, 'test-utf8-bom.txt' ) or die;

print( $ex2_fh $ex2_bom );   # Write the correct BOM
print( $ex2_fh "Hello World with utf-8 and BOM\n" );

close( $ex2_fh );


# Get encoding filter for writing a UTF-16 file with a Byte-Order-Marker.
my ( $ex3_encoding, $ex3_bom ) = SetFileEncoding( format => 'UTF-16' );
open( my $ex3_fh, '>' . $ex3_encoding, 'test-utf16.txt' ) or die;

print( $ex3_fh $ex3_bom );   # Write the correct BOM
print( $ex3_fh "Hello World with UTF-16\n" );

close( $ex3_fh );


# Get encoding filter for reading a utf8 file.

my $ex4_path = 'test-utf8.txt';

my ( $ex4_encoding, $ex4_bom ) = GetFileEncoding( path => $ex4_path );
open( my $ex4_fh, '<' . $ex4_encoding, $ex4_path ) or die;
SkipBOM( $ex4_fh, $ex4_bom );

while ( <$ex4_fh> ) { print };

close( $ex4_fh );

exit;




sub GetFileEncoding {
    my %arg = ( @_ );

    # Process minimally required arguments
    grep { croak message_req_param( 'parameter' => $_ ) unless exists $arg{$_} 
} qw( path );

    my $encoding = '';
    my $bom = '';

    if ( open( my $fh, '<:raw', $arg{'path'} ) ) {
        my $header;
        if ( read( $fh, $header, 4, 0 ) == 4 ) {
            $header = unpack( 'N', $header );
            if ( ( $header & 0xffffff00 ) == 0xefbbbf00 ) {
                $encoding = ":encoding(utf8)";
                # $bom = "\x{feff}";
                $bom = pack( 'C3', 0xef, 0xbb, 0xbf );

            } elsif ( ( $header & 0xffffffff ) == 0xfffe0000 ) {
                $encoding = ":encoding(UTF-32LE)";
                # $bom = "\x{feff}";
                $bom = pack( 'C4', 0xff, 0xfe, 0x00, 0x00 );

            } elsif ( ( $header & 0xffffffff ) == 0xfeff0000 ) {
                $encoding = ":encoding(UTF-32BE)";
                # $bom = "\x{feff}";
                $bom = pack( 'C4', 0xfe, 0xff, 0x00, 0x00 );

            } elsif ( ( $header & 0xffff0000 ) == 0xfffe0000 ) {
                $encoding = ":encoding(UTF-16LE)";
                # $bom = "\x{feff}";
                $bom = pack( 'C2', 0xff, 0xfe );

            } elsif ( ( $header & 0xffff0000 ) == 0xfeff0000 ) {
                $encoding = ":encoding(UTF-16BE)";
                # $bom = "\x{feff}";
                $bom = pack( 'C2', 0xfe, 0xff );
            }
        }

        close( $fh );
    }

    return ( wantarray ? ( $encoding, $bom ) : $encoding );
}


sub SetFileEncoding {
    # Default arguments
    my %arg = ( 'format' => 'ASCII',
                @_ );

    $arg{'format'} = 'iso-8859-1'  if ( $arg{'format'} =~ /(ASCII|ANSI)/i );
    $arg{'format'} = 'utf8'        if ( $arg{'format'} =~ /UTF(|-)8/i );
    $arg{'format'} = 'UTF-16BE'    if ( $arg{'format'} =~ /UTF(|-)16BE/i );
    $arg{'format'} = 'UTF-16LE'    if ( $arg{'format'} =~ /UTF(|-)16(|LE)/i );
    $arg{'format'} = 'UTF-32BE'    if ( $arg{'format'} =~ /UTF(|-)32BE/i );
    $arg{'format'} = 'UTF-32LE'    if ( $arg{'format'} =~ /UTF(|-)32(|LE)/i );

    my $encoding = sprintf( ':raw:encoding(%s):crlf:utf8', $arg{'format'} );

    my $bom      = ( $arg{'format'} eq 'iso-8859-1' ? '' : "\x{feff}" );

    return ( wantarray ? ( $encoding, $bom ) : $encoding );
}


sub SkipBOM {
    my ( $file_handle, $bom ) = @_;

    my $buffer;

    seek( $file_handle, length( $bom ), 0 );
}


sub WriteBOM {
    my ( $file_handle, $bom ) = @_;

    print( $file_handle $bom );
}



sub message_req_param {
    my %arg = ( 'package'   => (caller(1))[0],
                'function'  => (caller(1))[3],
                'parameter' => 'unspecified parameter',
                @_ );

    return sprintf( "Error: %s() requires '%s' to be specified",
                    $arg{'function'}, $arg{'parameter'} );
}

_______________________________________________
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to