"Dan Kogai" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > On Sep 15, 2005, at 07:05 , Steve Larson wrote: > > > What I want to do is add a version string comment at the beginning > > of .xml > > files. I test to see if the file is UNICODE (Encode::Unicode) or > > ASCII > > (Encode::XS) using guess_encoding. My ASCII case works fine but > > the regexp > > for the UNICODE case fails. Below snippet is the code for the > > UNICODE case. > > The answer is that PerlIO does not go well with BOMed UTFs. What you > should do instead is to read the whole file first like this; > > open my $in, "<:raw", $filename or die "$filename : $!"; > read $in, my $buf, -s $filename; # one of many ways to slurp file. > close $in; > my $content = decode("UTF16", $buffer); # LE or BE is not required. > # > # do whatever you want to $content and.... > # > open my $out, ":>raw", $filename or die "$filename : $!"; > print $out encode("UTF16-LE", $buffer); # now be explicit on endianness > close $out; > > Remember UTF-(16|32) does not go well with stream models. Treat it > as a binary file. > > Dan the Encode Maintainer > Thanks Dan. I still get no BOM when using UTF-16LE for output (using UTF16 I get a BOM and BE output). I need UTF-16LE byte order with a BOM just like the input when the input has a BOM. I also get 0x0A for \n instead of the 0x0A 0x0D I should be getting.
When I run across a file without a BOM with "...decode(UTF16,$buffer)", I get an error "UTF-16:Unrecognised BOM 3c00 at..." on reading the file in. So UTF16 is appearenly looking for a BOM. With "...decode(UTF-16LE,$buffer)" and "$content =~ s/(\x{fffe}(<\?.*?\?>)*)\n?/\n<!-- Build Ver..." I am able to read in files with or without a BOM. However, I get a warning "Unicode character 0xfffe is illegal at ..." and the BOM (when it exists) does not stay at the beginning of the file. It still seems like I am not understanding something (may be basic) about processing UTF files. I have read through the related docs in the help several times and the behavior seems to be the opposite in several cases. Any suggestions? Steve