On Sat, 30 Mar 2002, Gaspar Sinai wrote:

> I noticed that at ftp.unicode.org /Public/MAPPINGS/EASTASIA
> has been moved to OBSOLETE directory. README.TXT reads:
> 
>   The entire former contents of this directory are obsolete
>   and have been moved to the OBSOLETE directory.  The latest
>   information may be found in the Unihan.txt file in the latest
>   Unicode Character Database.
>   August 1, 2001.
> 
> I looked at Unihan.txt file but I found no way to extract
> GB2312.TXT JIS0208.TXT JIS0212.TXT KSC5601.TXT (KSX1001.TXT?)
> OLD5601.TXT and JIS0201.TXT files.

  KSC5601.TXT in OBSOLETE/EASTASIA is NOT the mapping
between Unicode and KS C 5601-1987 but the mapping between MS CP949 and
Unicode (sans US-ASCII portion).  OLD5601.TXT is the mapping between
KS C 5601-1987(KS C 5601-1992 and KS X 1001:1997) and Unicode 2.0. So
is KSX1001.TXT.

> For instance:
> JIS0201.TXT:
> 0xB1    0xFF71  # HALFWIDTH KATAKANA LETTER A
> 
> cd Public/UNIDATA
> grep -i FF71 *.* | grep -i B1
> 
> proves that neither Unihan.txt nor any of the other UNIDATA
> files can be used to generate JIS0201.TXT.
> 
> The question is: What is the best source for these maps?
> Is there a place where they are centrally maintained?

  You can extract two different mappings between EUC-KR
and Unicode from CP949.TXT (in VENDORS/MICSFT/) and KOREAN.TXT
(in VENDORS/APPLE).  Just filter out non-EUC portion and keep EUC
codepoints only (that is, 0x00-0x7E for single byte characters and
[0xA1-0xFE][0xA1-0xFE] for double byte characters). If you want
the mapping KS X 1001 and Unicode, you can subtract 0x8080 from
codepoints of two byte characters in EUC-KR.  I've put them up
at

   http://jshin.net/faq/KSX1001.TXT.gz  (extracted from CP949.TXT)
   http://jshin.net/faq/JOHAB.TXT.gz    (for Johab)


 The difference between two mappings are well explained in Apple's
Korean mapping table, KOREAN.TXT
(ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/KOREAN.TXT).
Another difference is that Apple's Korean mapping doesn't have two new
characters added to KS X 1001 in December, 1998.  They're EURO SIGN
(U+20AC) at row 2 column 70 (0xA2E6 in EUC-KR and 0x2266 in ISO-2022-KR)
and REGISTERED SIGN (U+00AE) at row 2 column 71 (0xA2E7 in EUC-KR and
0x2267 in ISO-2022-KR).  Glibc and libiconv have already added them.


   Jungshik Shin


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to