Thanks Mark.  I figured it out another way.  Instead I just open the file in 
binmode raw and strip out the 00( hex ) characters, and from there I can get 
what I need just fine.  It was a windows platform, btw.  Thanks for taking a 
look.

Eric

"I'd take you seriously but to do so would be an affront to your intelligence."
-- William F. Buckley --



> Date: Wed, 2 Apr 2008 16:05:01 -0400
> From: [EMAIL PROTECTED]
> To: [email protected]
> Subject: Re: unicode encoding question
> 
> Hi Eric,
> 
> You really need to invoke the encoding IO layers on the open() call.  once 
> you've done that, then input and output need nothing special to make them 
> work.  The trick is determining what kind of file you are dealing with and 
> what set of filters to specify.
> 
> These are the functions that I'm using regularly (on a Windows platform, 
> since 
> you didn't specify):
> 
> 
> #################################################
> #
> # get_file_encoding
> #
> # Determine how a file is encoded and return an encoding string for
> # correctly opening the file for reading.
> #
> #  eg.   my ( $encoding, $bom ) = get_file_encoding( $path );
> #        open( my $fh, '<' . $encoding, $path ) or die;
> #        skip_bom( $fh, $bom );
> #
> 
> sub get_file_encoding {
>      my $path = shift;
>      my $encoding = '';
>      my $bom = '';
> 
>      if ( open( my $file, '<', $path ) ) {
>          my $header;
>          if ( read( $file, $header, 4, 0 ) == 4 ) {
>              my $header = unpack( 'N', $header );
>              if ( ( $header & 0xffffff00 ) == 0xefbbbf00 ) {
>                  $encoding = ":encoding(utf8)";
>                  $bom = pack( 'C3', 0xef, 0xbb, 0xbf );
> 
>              } elsif ( ( $header & 0xffffffff ) == 0xfffe0000 ) {
>                  $encoding = ":encoding(UTF-32LE)";
>                  $bom = pack( 'C4', 0xff, 0xfe, 0x00, 0x00 );
> 
>              } elsif ( ( $header & 0xffffffff ) == 0xfeff0000 ) {
>                  $encoding = ":encoding(UTF-32BE)";
>                  $bom = pack( 'C4', 0xfe, 0xff, 0x00, 0x00 );
> 
>              } elsif ( ( $header & 0xffff0000 ) == 0xfffe0000 ) {
>                  $encoding = ":encoding(UTF-16LE)";
>                  $bom = pack( 'C2', 0xff, 0xfe );
> 
>              } elsif ( ( $header & 0xffff0000 ) == 0xfeff0000 ) {
>                  $encoding = ":encoding(UTF-16BE)";
>                  $bom = pack( 'C2', 0xfe, 0xff );
>              }
>          }
> 
>          close( $file );
>      }
> 
>      return ( wantarray ? ( $encoding, $bom ) : $encoding );
> }
> 
> 
> 
> #################################################
> #
> # set_file_encoding
> #
> # Generate encoding and bom strings for use when correctly
> # opening UTF-8/Unicode files for writing.
> #
> #  eg.   my ( $encoding, $bom ) = set_file_encoding( 'UTF-16' );
> #        open( my $fh, '>' . $encoding, $path ) or die;
> #        write_bom( $fh, $bom );
> #
> 
> sub set_file_encoding {
>      my $codepage = shift;
>      $codepage = 'utf8'        if ( uc($codepage) eq 'UTF-8' );
>      $codepage = 'utf8'        if ( uc($codepage) eq 'UTF8' );
>      $codepage = 'UTF-16LE'    if ( uc($codepage) eq 'UTF-16' );
>      $codepage = 'UTF-16LE'    if ( uc($codepage) eq 'UTF16' );
>      $codepage = 'UTF-16BE'    if ( uc($codepage) eq 'UTF16BE' );
>      $codepage = 'UTF-32LE'    if ( uc($codepage) eq 'UTF-32' );
>      $codepage = 'UTF-32LE'    if ( uc($codepage) eq 'UTF32' );
>      $codepage = 'UTF-32BE'    if ( uc($codepage) eq 'UTF32BE' );
>      $codepage = 'iso-8859-1'  if ( uc($codepage) eq 'ASCII' );
>      $codepage = 'iso-8859-1'  if ( uc($codepage) eq 'ANSI' );
> 
>      my $encoding = sprintf( ':raw:encoding(%s):crlf:utf8', $codepage );
> 
>      my $bom = '';
>      $bom = "\x{feff}"  unless ( $codepage eq 'iso-8859-1' );
> 
>      return ( wantarray ? ( $encoding, $bom ) : $encoding );
> }
> 
> 
> 
> #################################################
> #
> # skip_bom
> #
> # Move the file pointer to start reading after any Byte-Order-Marker
> # detected by file_encoding().
> #
> 
> sub skip_bom {
>      my ( $file_handle, $bom ) = @_;
>      seek( $file_handle, length( $bom ), 1 );
> }
> 
> 
> 
> #################################################
> #
> # write_bom
> #
> # Write a Byte-Order-Marker to the given file handle.
> #
> 
> sub write_bom {
>      my ( $file_handle, $bom ) = @_;
>      print( $file_handle $bom );
> }
> 
> 
> Cheers,
> Mark
> 
> -------- Original Message  --------
> Subject: unicode encoding question
> From: eric clark <[EMAIL PROTECTED]>
> To: [email protected]
> Date: Wednesday, April 02, 2008 2:48:24 PM
> 
> > I have a file that I am attempting to parse that is in unicode.  Here is 
> > the code I am using:
> > 
> > use Encode;
> > 
> > $enc = find_encoding("ascii");
> > 
> > open( OUTFP, ">output.txt" ) || die "Error opening output.txt: $!\n";
> > open( VER, "Ver.htm" ) || die "Error opening Vers.htm: $!\n";
> > 
> > while( <VER> )
> > {
> >     ## Regular expression goes here ##
> >     $line = $enc->encode( $_ );
> >     print OUTFP "$line\n\n\n";
> > 
> > }
> > 
> > close( VER );
> > close( OUTFP );
> > 
> > I've tried using every encoding installed with that module, both decode 
> > and encode and the output is always the same.  Basically the file I  am 
> > reading is unicode, so all the characters are padded.  I want this to be 
> > either decoded into a normal text file, or at least be able to use the 
> > regular expression.  No matter what the expression always fails.
> > 
> > Any ideas?
> > 
> > Thanks,
> >     Eric
> > 
> > "I'd take you seriously but to do so would be an affront to your 
> > intelligence."
> > -- William F. Buckley --
> > 
> > 
> > 
> > ------------------------------------------------------------------------
> > Use video conversation to talk face-to-face with Windows Live Messenger. 
> > Get started! 
> > <http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_video_042008>
> > 
> > 
> > ------------------------------------------------------------------------
> > 
> > _______________________________________________
> > Perl-Win32-Admin mailing list
> > [email protected]
> > To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
> _______________________________________________
> Perl-Win32-Admin mailing list
> [email protected]
> To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

_________________________________________________________________
Use video conversation to talk face-to-face with Windows Live Messenger.
http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_video_042008
_______________________________________________
Perl-Win32-Admin mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to