What I want to do is add a version string comment at the beginning of .xml files. I test to see if the file is UNICODE (Encode::Unicode) or ASCII (Encode::XS) using guess_encoding. My ASCII case works fine but the regexp for the UNICODE case fails. Below snippet is the code for the UNICODE case.
1# RE-read file to be updated using UNICODE 2 open VERSIONEDFILE, "<:encoding(UTF-16LE)", "$working_file_list[0]" or warn "versionfiles : warning : $working_file_list[0]--Unable to set version 3in file\n"; 4 $filecontents = ""; 5 @filecontents = <VERSIONEDFILE> or warn "versionfiles : warning : $working_file_list[0]--Unable to set version in file because of possible Unrecognized BOM\n"; 6 $filecontents = join '', @filecontents; # pull array into a single string so it can be parsed 7 close VERSIONEDFILE; 8 9# write updates to temporary file 10 open VERSIONEDFILE, ">:encoding(UTF-16LE)", "$folder//_temp_file" or warn "versionfiles : warning : $working_file_list[0]--Unable to set version in file\n"; 11 # place <!-- Build Version: $build_number --> after 12 # <? ?> delimited comments that are supposed to be at the top of the file 13 # \xFF\xFE is the BOM for UTF-16LE 14 $filecontents =~ s/(\x{fffe}(<\?.*?\?>)*)\n?/\n<!-- Build Version: $build_number -->\n/s or warn "versionfiles : warning : $working_file_list[0]--Unexpected format. Unable to parse to set version.\n"; 15 print VERSIONEDFILE $filecontents; 16 close VERSIONEDFILE; 17 18# replace original file 19 system ("attrib -R \"$working_file_list[0]\""); 20 rename "$folder\\_temp_file", $working_file_list[0] or warn "versionfiles : warning : $working_file_list[0]--Unable to set version in file\n"; Problems I have: 1. The expression at line 14 fails as is and prints the warn message. 2. I can get the expression at line 14 to succeed if I change from UTF-16LE to UTF-16BE on lines 2 and 10; but, with \n in the substitution I get 0x00 0x0D 0x0A (Null, Carriage Return, Line Feed) in it's place. This makes the CRLF appear as garbage and throws off characters after it. (The contents of the output look the same with the UTF-16LE BOM whether UTF-16LE or UTF-16BE are used if lines 15 and 16 were moved before line 14. However, using UTF-16BE throws a warning "Unicode character 0xfffe is illegal at [script name and line]" at line 5.) 3. Without \x{fffe} in the expression I get no BOM at the beginning of the file, the string is inserted in the front of the file followed by the original BOM and the rest of the file, and \n then becomes 0x0D 0x0A 0x00 (Carriage Return, Line Feed, Null). (Sort of expected Perl to keep the correct BOM in place but I can deal with this part.) Why would the substitution fail as written? How do I get the CRLF to use only the two bytes it is supposed to? Is there a better way to do the task? Steve Environment: ActiveState Perl v5.8.7 built for MSWin32-x86-multi-thread, Build 813, Compiled at Jun 6 2005 13:36:37 Input files are UTF-16LE with a BOM of 0xFF 0xFE. Basis for my assumptions Based on Perl\html\lib\Pod\perlunicode.html in the distribution, "Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16 itself can be used for in-memory computations, but if storage or transfer is required either UTF-16BE (big-endian) or UTF-16LE (little-endian) encodings must be chosen. "This introduces another problem: what if you just know that your data is UTF-16, but you don't know which endianness? Byte Order Marks, or BOMs, are a solution to this. A special character has been reserved in Unicode to function as a byte order marker: the character with the code point U+FEFF is the BOM. "The trick is that if you read a BOM, you will know the byte order, since if it was written on a big-endian platform, you will read the bytes 0xFE 0xFF, but if it was written on a little-endian platform, you will read the bytes 0xFF 0xFE."