UTF-16LE fails in substitution

Steve Larson Thu, 15 Sep 2005 04:12:41 -0700

What I want to do is add a version string comment at the beginning of .xml
files.  I test to see if the file is UNICODE (Encode::Unicode) or ASCII
(Encode::XS) using guess_encoding.  My ASCII case works fine but the regexp
for the UNICODE case fails.  Below snippet is the code for the UNICODE case.




1# RE-read file to be updated using UNICODE

2          open VERSIONEDFILE, "<:encoding(UTF-16LE)",
"$working_file_list[0]" or warn "versionfiles : warning :
$working_file_list[0]--Unable to set version 3in file\n";

4          $filecontents = "";

5          @filecontents = <VERSIONEDFILE> or warn "versionfiles : warning :
$working_file_list[0]--Unable to set version in file because of possible
Unrecognized BOM\n";

6          $filecontents = join '', @filecontents; # pull array into a
single string so it can be parsed

7          close VERSIONEDFILE;

8

9# write updates to temporary file

10        open VERSIONEDFILE, ">:encoding(UTF-16LE)", "$folder//_temp_file"
or warn "versionfiles : warning : $working_file_list[0]--Unable to set
version in file\n";

11        # place <!-- Build Version: $build_number --> after

12        # <?  ?> delimited comments that are supposed to be at the top of
the file

13        # \xFF\xFE is the BOM for UTF-16LE

14        $filecontents =~ s/(\x{fffe}(<\?.*?\?>)*)\n?/\n<!-- Build Version:
$build_number -->\n/s or warn "versionfiles : warning :
$working_file_list[0]--Unexpected format. Unable to parse to set
version.\n";

15        print VERSIONEDFILE $filecontents;

16        close VERSIONEDFILE;

17

18# replace original file

19        system ("attrib -R \"$working_file_list[0]\"");

20        rename "$folder\\_temp_file", $working_file_list[0] or warn
"versionfiles : warning : $working_file_list[0]--Unable to set version in
file\n";





Problems I have:

1. The expression at line 14 fails as is and prints the warn message.

2. I can get the expression at line 14 to succeed if I change from UTF-16LE
to UTF-16BE on lines 2 and 10; but, with \n in the substitution I get 0x00
0x0D 0x0A (Null, Carriage Return, Line Feed) in it's place.  This makes the
CRLF appear as garbage and throws off characters after it. (The contents of
the output look the same with the UTF-16LE BOM whether UTF-16LE or UTF-16BE
are used if lines 15 and 16 were moved before line 14.  However, using
UTF-16BE throws a warning "Unicode character 0xfffe is illegal at [script
name and line]" at line 5.)

3. Without \x{fffe} in the expression I get no BOM at the beginning of the
file, the string is inserted in the front of the file followed by the
original BOM and the rest of the file, and \n then becomes 0x0D 0x0A 0x00
(Carriage Return, Line Feed, Null).  (Sort of expected Perl to keep the
correct BOM in place but I can deal with this part.)



Why would the substitution fail as written?

How do I get the CRLF to use only the two bytes it is supposed to?

Is there a better way to do the task?



Steve



Environment:

ActiveState Perl v5.8.7 built for MSWin32-x86-multi-thread, Build 813,
Compiled at Jun  6 2005 13:36:37

Input files are UTF-16LE with a BOM of 0xFF 0xFE.



Basis for my assumptions

Based on Perl\html\lib\Pod\perlunicode.html in the distribution,

"Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16 itself
can be used for in-memory computations, but if storage or transfer is
required either UTF-16BE (big-endian) or UTF-16LE (little-endian) encodings
must be chosen.



"This introduces another problem: what if you just know that your data is
UTF-16, but you don't know which endianness? Byte Order Marks, or BOMs, are
a solution to this. A special character has been reserved in Unicode to
function as a byte order marker: the character with the code point U+FEFF is
the BOM.



"The trick is that if you read a BOM, you will know the byte order, since if
it was written on a big-endian platform, you will read the bytes 0xFE 0xFF,
but if it was written on a little-endian platform, you will read the bytes
0xFF 0xFE."

UTF-16LE fails in substitution

Reply via email to