"Dan Kogai" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> On Sep 15, 2005, at 07:05 , Steve Larson wrote:
>
> > What I want to do is add a version string comment at the beginning
> > of .xml
> > files.  I test to see if the file is UNICODE (Encode::Unicode) or
> > ASCII
> > (Encode::XS) using guess_encoding.  My ASCII case works fine but
> > the regexp
> > for the UNICODE case fails.  Below snippet is the code for the
> > UNICODE case.
>
> The answer is that PerlIO does not go well with BOMed UTFs.  What you
> should do instead is to read the whole file first like this;
>
> open my $in, "<:raw", $filename or die "$filename : $!";
> read $in, my $buf, -s $filename; # one of many ways to slurp file.
> close $in;
> my $content = decode("UTF16", $buffer); # LE or BE is not required.
> #
> # do whatever you want to $content and....
> #
> open my $out, ":>raw", $filename or die "$filename : $!";
> print $out encode("UTF16-LE", $buffer); # now be explicit on endianness
> close $out;
>
> Remember UTF-(16|32) does not go well with stream models.  Treat it
> as a binary file.
>
> Dan the Encode Maintainer
>
Thanks Dan.
I still get no BOM when using UTF-16LE for output (using UTF16 I get a BOM
and BE output).  I need UTF-16LE byte order with a BOM just like the input
when the input has a BOM.  I also get 0x0A for \n instead of the 0x0A 0x0D I
should be getting.

When I run across a file without a BOM with "...decode(UTF16,$buffer)", I
get an error "UTF-16:Unrecognised BOM 3c00 at..." on reading the file in.
So UTF16 is appearenly looking for a BOM.
With "...decode(UTF-16LE,$buffer)" and "$content =~
s/(\x{fffe}(<\?.*?\?>)*)\n?/\n<!-- Build Ver..." I am able to read in files
with or without a BOM.  However, I get a warning "Unicode character 0xfffe
is illegal at ..." and the BOM (when it exists) does not stay at the
beginning of the file.

It still seems like I am not understanding something (may be basic) about
processing UTF files.  I have read through the related docs in the help
several times and the behavior seems to be the opposite in several cases.
Any suggestions?

Steve


Reply via email to