Thanks Mark. I figured it out another way. Instead I just open the file in
binmode raw and strip out the 00( hex ) characters, and from there I can get
what I need just fine. It was a windows platform, btw. Thanks for taking a
look.
Eric
"I'd take you seriously but to do so would be an affront to your intelligence."
-- William F. Buckley --
> Date: Wed, 2 Apr 2008 16:05:01 -0400
> From: [EMAIL PROTECTED]
> To: [email protected]
> Subject: Re: unicode encoding question
>
> Hi Eric,
>
> You really need to invoke the encoding IO layers on the open() call. once
> you've done that, then input and output need nothing special to make them
> work. The trick is determining what kind of file you are dealing with and
> what set of filters to specify.
>
> These are the functions that I'm using regularly (on a Windows platform,
> since
> you didn't specify):
>
>
> #################################################
> #
> # get_file_encoding
> #
> # Determine how a file is encoded and return an encoding string for
> # correctly opening the file for reading.
> #
> # eg. my ( $encoding, $bom ) = get_file_encoding( $path );
> # open( my $fh, '<' . $encoding, $path ) or die;
> # skip_bom( $fh, $bom );
> #
>
> sub get_file_encoding {
> my $path = shift;
> my $encoding = '';
> my $bom = '';
>
> if ( open( my $file, '<', $path ) ) {
> my $header;
> if ( read( $file, $header, 4, 0 ) == 4 ) {
> my $header = unpack( 'N', $header );
> if ( ( $header & 0xffffff00 ) == 0xefbbbf00 ) {
> $encoding = ":encoding(utf8)";
> $bom = pack( 'C3', 0xef, 0xbb, 0xbf );
>
> } elsif ( ( $header & 0xffffffff ) == 0xfffe0000 ) {
> $encoding = ":encoding(UTF-32LE)";
> $bom = pack( 'C4', 0xff, 0xfe, 0x00, 0x00 );
>
> } elsif ( ( $header & 0xffffffff ) == 0xfeff0000 ) {
> $encoding = ":encoding(UTF-32BE)";
> $bom = pack( 'C4', 0xfe, 0xff, 0x00, 0x00 );
>
> } elsif ( ( $header & 0xffff0000 ) == 0xfffe0000 ) {
> $encoding = ":encoding(UTF-16LE)";
> $bom = pack( 'C2', 0xff, 0xfe );
>
> } elsif ( ( $header & 0xffff0000 ) == 0xfeff0000 ) {
> $encoding = ":encoding(UTF-16BE)";
> $bom = pack( 'C2', 0xfe, 0xff );
> }
> }
>
> close( $file );
> }
>
> return ( wantarray ? ( $encoding, $bom ) : $encoding );
> }
>
>
>
> #################################################
> #
> # set_file_encoding
> #
> # Generate encoding and bom strings for use when correctly
> # opening UTF-8/Unicode files for writing.
> #
> # eg. my ( $encoding, $bom ) = set_file_encoding( 'UTF-16' );
> # open( my $fh, '>' . $encoding, $path ) or die;
> # write_bom( $fh, $bom );
> #
>
> sub set_file_encoding {
> my $codepage = shift;
> $codepage = 'utf8' if ( uc($codepage) eq 'UTF-8' );
> $codepage = 'utf8' if ( uc($codepage) eq 'UTF8' );
> $codepage = 'UTF-16LE' if ( uc($codepage) eq 'UTF-16' );
> $codepage = 'UTF-16LE' if ( uc($codepage) eq 'UTF16' );
> $codepage = 'UTF-16BE' if ( uc($codepage) eq 'UTF16BE' );
> $codepage = 'UTF-32LE' if ( uc($codepage) eq 'UTF-32' );
> $codepage = 'UTF-32LE' if ( uc($codepage) eq 'UTF32' );
> $codepage = 'UTF-32BE' if ( uc($codepage) eq 'UTF32BE' );
> $codepage = 'iso-8859-1' if ( uc($codepage) eq 'ASCII' );
> $codepage = 'iso-8859-1' if ( uc($codepage) eq 'ANSI' );
>
> my $encoding = sprintf( ':raw:encoding(%s):crlf:utf8', $codepage );
>
> my $bom = '';
> $bom = "\x{feff}" unless ( $codepage eq 'iso-8859-1' );
>
> return ( wantarray ? ( $encoding, $bom ) : $encoding );
> }
>
>
>
> #################################################
> #
> # skip_bom
> #
> # Move the file pointer to start reading after any Byte-Order-Marker
> # detected by file_encoding().
> #
>
> sub skip_bom {
> my ( $file_handle, $bom ) = @_;
> seek( $file_handle, length( $bom ), 1 );
> }
>
>
>
> #################################################
> #
> # write_bom
> #
> # Write a Byte-Order-Marker to the given file handle.
> #
>
> sub write_bom {
> my ( $file_handle, $bom ) = @_;
> print( $file_handle $bom );
> }
>
>
> Cheers,
> Mark
>
> -------- Original Message --------
> Subject: unicode encoding question
> From: eric clark <[EMAIL PROTECTED]>
> To: [email protected]
> Date: Wednesday, April 02, 2008 2:48:24 PM
>
> > I have a file that I am attempting to parse that is in unicode. Here is
> > the code I am using:
> >
> > use Encode;
> >
> > $enc = find_encoding("ascii");
> >
> > open( OUTFP, ">output.txt" ) || die "Error opening output.txt: $!\n";
> > open( VER, "Ver.htm" ) || die "Error opening Vers.htm: $!\n";
> >
> > while( <VER> )
> > {
> > ## Regular expression goes here ##
> > $line = $enc->encode( $_ );
> > print OUTFP "$line\n\n\n";
> >
> > }
> >
> > close( VER );
> > close( OUTFP );
> >
> > I've tried using every encoding installed with that module, both decode
> > and encode and the output is always the same. Basically the file I am
> > reading is unicode, so all the characters are padded. I want this to be
> > either decoded into a normal text file, or at least be able to use the
> > regular expression. No matter what the expression always fails.
> >
> > Any ideas?
> >
> > Thanks,
> > Eric
> >
> > "I'd take you seriously but to do so would be an affront to your
> > intelligence."
> > -- William F. Buckley --
> >
> >
> >
> > ------------------------------------------------------------------------
> > Use video conversation to talk face-to-face with Windows Live Messenger.
> > Get started!
> > <http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_video_042008>
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Perl-Win32-Admin mailing list
> > [email protected]
> > To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
> _______________________________________________
> Perl-Win32-Admin mailing list
> [email protected]
> To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
_________________________________________________________________
Use video conversation to talk face-to-face with Windows Live Messenger.
http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_video_042008_______________________________________________
Perl-Win32-Admin mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs