On 2002.02.01, at 19:24, Nick Ing-Simmons wrote:
> As part of the mystery of CJK encodings I notice that IBM's ICU's uconv
> and SuSE6.4 linux iconv differ as to the UTF-8 representation if 
> table.euc
>
> Both converters will round-trip with themselves and give byte exact
> copy of table.euc
>
> Weirdly they differ in how they map '\' and '~' in ASCII space as
> well as some spots in higher characters.

   Oh, yes.  This is the problem of the original Unicode 2.x map;  It is 
not ASCII preservative.  I have posted this problem to perl-
[EMAIL PROTECTED] when I first released Jcode.  Several discussions 
later, I made Jcode so that it preserves ASCII by default and added 
$Jcode::Unicode::PEDANTIC to change the behavior
   Here is the exerpt from Jcode::Unicode

> VARIABLES
>        $Jcode::Unicode::PEDANTIC
>            When set to non-zero, x-to-unicode conversion becomes
>            pedantic.  That is, '\' (chr(0x5c)) is converted to
>            zenkaku backslash and '~" (chr(0x7e)) to JIS-x0212
>            tilde.
>
>            By Default, Jcode::Unicode leaves ascii ([0x00-0x7f])
>            as it is.


> Linux iconv will not take ICU's UTF-8.
> ICU's uconv will read the iconv output but does produce same as original
> table.euc.

   So far as I see Linux iconv is ascii-preservative while ICS's is 
Unicode-strict.
   From Perl's point of view ASCII preservative should be default.
   FYI I have reported this brain-dead mapping problem to Unicode 
Consortium but never got an answer.  Well, they are not public society 
in a way they charge for the membership to say anything.   One of the 
reasons so many Japanese love to hate Unicode...

> Our current euc-jp.ucm is compatible with Linux iconv.

   Right choice.

Dan the Man with So Many Charsets to Deal With

Reply via email to