Hi...I found this DL via the perldoc.perl.org/perluniintro page...if I'm violating protocol for writing directly, please pardon.
I have 2 data files I want to compare...one is in UTF-16BE (Windows "Unicode" format) and one is in UTF-8 format. I wrote 3 perl programs: *)1 to normalize data in the UTF-16BE source and write to a UTF-8 formatted output file *)1 to normalize data in the UTF-8 source and write to a UTF-8 output file *)1 to do a string comparison of the 2 output files and output 3 files: "common items from both files", "items unique to UTF-16BE source", and "items unique to UTF-8 source". I noticed that the UTF16BE->UTF-8 conversion works fine, except for a very few characters. Specifically, the Right-Quote: http://www.fileformat.info/info/unicode/char/2019/index.htm It is appearing in the source UTF-16BE file in a character stream such as "...owe's...", where the ' is the character above, not the apostrophe I have used to represent it. The problem seems to me is that when the decode function sees it, it is merging the "'s" into some other bizarre characters, and I have to do this replacement BEFORE decode() to avoid the problem: $char_inline =~ s/\x19\xE2\x81\xB3/\xE2\x80\x99\x73/; I've tried using the Unicode::Normalization routines, sometimes before and sometimes after decode() to test all possible states that might yield the right result to no avail. While this fails via decode(), if I use "iconv -f UTF-16 -t UTF-8" on Solaris 9, the resultant output file is in UTF-8 format, and has the correct Right-Quote character. This makes me think that the decode function, or Perl-internal code page conversion function is incomplete/in error for at least a portion of the available code pages between various Unicode code-pages. Since it would appear that the Normalization routines only really have value **after** the decode() conversion from some-random-code-page -> UTF-8, it would be great if there were a way to ensure that the initial conversion was always correct and complete. With the exception of iconv, I ran perl on Windows, so, perhaps there is a problem only with the Windows port? Otherwise: 1) Please be aware of this error 2) Any suggestions (other than pre-translating via "iconv" ;-) Thanks! -NICK