On Sat, 30 Mar 2002, Gaspar Sinai wrote: > I noticed that at ftp.unicode.org /Public/MAPPINGS/EASTASIA > has been moved to OBSOLETE directory. README.TXT reads: > > The entire former contents of this directory are obsolete > and have been moved to the OBSOLETE directory. The latest > information may be found in the Unihan.txt file in the latest > Unicode Character Database. > August 1, 2001. > > I looked at Unihan.txt file but I found no way to extract > GB2312.TXT JIS0208.TXT JIS0212.TXT KSC5601.TXT (KSX1001.TXT?) > OLD5601.TXT and JIS0201.TXT files.
KSC5601.TXT in OBSOLETE/EASTASIA is NOT the mapping between Unicode and KS C 5601-1987 but the mapping between MS CP949 and Unicode (sans US-ASCII portion). OLD5601.TXT is the mapping between KS C 5601-1987(KS C 5601-1992 and KS X 1001:1997) and Unicode 2.0. So is KSX1001.TXT. > For instance: > JIS0201.TXT: > 0xB1 0xFF71 # HALFWIDTH KATAKANA LETTER A > > cd Public/UNIDATA > grep -i FF71 *.* | grep -i B1 > > proves that neither Unihan.txt nor any of the other UNIDATA > files can be used to generate JIS0201.TXT. > > The question is: What is the best source for these maps? > Is there a place where they are centrally maintained? You can extract two different mappings between EUC-KR and Unicode from CP949.TXT (in VENDORS/MICSFT/) and KOREAN.TXT (in VENDORS/APPLE). Just filter out non-EUC portion and keep EUC codepoints only (that is, 0x00-0x7E for single byte characters and [0xA1-0xFE][0xA1-0xFE] for double byte characters). If you want the mapping KS X 1001 and Unicode, you can subtract 0x8080 from codepoints of two byte characters in EUC-KR. I've put them up at http://jshin.net/faq/KSX1001.TXT.gz (extracted from CP949.TXT) http://jshin.net/faq/JOHAB.TXT.gz (for Johab) The difference between two mappings are well explained in Apple's Korean mapping table, KOREAN.TXT (ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/KOREAN.TXT). Another difference is that Apple's Korean mapping doesn't have two new characters added to KS X 1001 in December, 1998. They're EURO SIGN (U+20AC) at row 2 column 70 (0xA2E6 in EUC-KR and 0x2266 in ISO-2022-KR) and REGISTERED SIGN (U+00AE) at row 2 column 71 (0xA2E7 in EUC-KR and 0x2267 in ISO-2022-KR). Glibc and libiconv have already added them. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/