On 2002.02.01, at 19:24, Nick Ing-Simmons wrote: > As part of the mystery of CJK encodings I notice that IBM's ICU's uconv > and SuSE6.4 linux iconv differ as to the UTF-8 representation if > table.euc > > Both converters will round-trip with themselves and give byte exact > copy of table.euc > > Weirdly they differ in how they map '\' and '~' in ASCII space as > well as some spots in higher characters.
Oh, yes. This is the problem of the original Unicode 2.x map; It is not ASCII preservative. I have posted this problem to perl- [EMAIL PROTECTED] when I first released Jcode. Several discussions later, I made Jcode so that it preserves ASCII by default and added $Jcode::Unicode::PEDANTIC to change the behavior Here is the exerpt from Jcode::Unicode > VARIABLES > $Jcode::Unicode::PEDANTIC > When set to non-zero, x-to-unicode conversion becomes > pedantic. That is, '\' (chr(0x5c)) is converted to > zenkaku backslash and '~" (chr(0x7e)) to JIS-x0212 > tilde. > > By Default, Jcode::Unicode leaves ascii ([0x00-0x7f]) > as it is. > Linux iconv will not take ICU's UTF-8. > ICU's uconv will read the iconv output but does produce same as original > table.euc. So far as I see Linux iconv is ascii-preservative while ICS's is Unicode-strict. From Perl's point of view ASCII preservative should be default. FYI I have reported this brain-dead mapping problem to Unicode Consortium but never got an answer. Well, they are not public society in a way they charge for the membership to say anything. One of the reasons so many Japanese love to hate Unicode... > Our current euc-jp.ucm is compatible with Linux iconv. Right choice. Dan the Man with So Many Charsets to Deal With