Re: EUC-JP <-> Unicode roundtrip compatibility

Markus Kuhn Wed, 11 Apr 2001 06:10:14 -0700
Tomohiro KUBOTA wrote on 2001-04-11 12:24 UTC:
> > As far as I can tell, the
> > relevant mapping and unihan tables on http://www.unicode.org/Public/ are
> > 100% bug-free by definition, as they were used to print the Han columns
> > in ISO 10646-1.
> 
> Saying about Unicode Consortium's conversion table, it is impossible
> to construct round-trip compatible EUC-JP <-> UCS conversion table.
> This is because
> 
>         http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT
> 
> has the following line:
> 
>         0x815F        0x2140  0x005C  # REVERSE SOLIDUS
> 
> This means that 0x2140 in JIS X 0208 (0xA1 0xC0 in EUC-JP) is
> mapped into U+005C.
> 
> Note that EUC-JP is a CES (Character Encoding Scheme) whose
> CCS (Coded Character Sets) are ASCII and JIS X 0208 (optionally
> JIS X 0201 Kana and JIS X 0212).
> 
> Which should U+005C be converted into in EUC-JP, 0x5C or 0xA1 0xC0?

Ah, I see. While the Unicode mapping tables provide round-trip
compatibility to both JIS X 0208 and ASCII individually, they do not
provide round-trip compatibility to an encoding such as EUC-JP that
distinguishes between every element of JIS X 0208 and ASCII.

I think, there is an easy and straight forward solution out of this:

If (and only if) you map EUC-JP to Unicode, just replace in the JIS X
0208 mapping table the above line with

        0x815F  0x2140  0xFF3C  # FULLWIDTH REVERSE SOLIDUS

EUC-JP 0xA1 0xC0 maps to U+FF3C FULLWIDTH REVERSE SOLIDUS.

This seems to be the suitable corresponding Unicode character if you
need round-trip compatibility between Unicode and JIS X 0208 + ASCII.

I think, on POSIX systems you most definitely want in a Unicode to
EUC-JP conversion to map U+005C into the ASCII 0x5C, because lots of
software assigns special semantics to this character (e.g., C string
syntax "\n", etc.). It was my understanding that this is what glibc does
(and its regression test suite even enforces) anyway.

If you do a Unicode to JIS X 0208 (not in the context of EUC-JP)
conversion, *both* U+005C *and* U+FF3C should be mapped onto JIS X 0208
code point 0x2140. This way you never loose anything.

Is there anything wrong with this approach?

Is that any different from what iconv does at the moment, Bruno?

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: EUC-JP <-> Unicode roundtrip compatibility

Reply via email to