Dan Kogai passed on this question: > On Wednesday, March 27, 2002, at 08:06 , Anton Tagunov wrote: > > Hello, Dan! > > > > BTW, is the guy speaking > > > > http://www.debian.or.jp/~kubota/unicode-symbols.html.en > > > > right or not? His article is dated september.. > > He is speaking about lack of official tables, that the tables > > have been withdrawn and made obsolete but without any replacement. > > Is that true? >
and answered it: > Correct. > > http://www.unicode.org/Public/MAPPINGS/EASTASIA/Readme.txt > > The entire former contents of this directory are obsolete and have been > > moved to the OBSOLETE directory. The latest information may be found > > in the Unihan.txt file in the latest Unicode Character Database. > > August 1, 2001. > > But the Unihan database only contains JIS X 0208 and 0212, not JIS X > 0201 because 0201 only maps ASCII and Halfwidth Katakana. > > I did also ask if the issue Kubota-san has raise has ever been worked on > this at [EMAIL PROTECTED] but I got no definitive answer. So here is your definitive answer. The mapping issues that Kubota-san raises in that document are real and complicated, having to do problems of roundtripping through various East Asian character encodings, differences in vendor mappings, and differences in interpretation of character widths. The problems of specification of character widths *are* clearly within the scope of UAX #11, East Asian Width, and some of Kubota-san's issues have been dealt with there in recent revisions of UAX #11 and its associated data table, EastAsianWidth.txt for Unicode 3.2. However, the issue of non-Han character mappings between Unicode and legacy East Asian character sets has been the subject of some misunderstanding. Contrary to popular opinion, the Unicode Consortium has *never* published authoritative mapping tables for any of the East Asian legacy standards per se. The Windows and Macintosh East Asian mapping tables are provided by Microsoft and Apple, respectively, and are vouched for by those companies as representing their vendor mappings. The tables for the East Asian *standards*, on the other hand, such as JIS0201, JIS0208, JIS0208, KSC5601.TXT, BIG5.TXT, CNS11643.TXT were only ever provided as tentative, informational-only tables. No claims were made about those tables being authoritative determinations by the Unicode Consortium as to what the mappings *should* be. On the contrary, the tables had very tentative wording, indeed; for example: # This table contains the data the Unicode Consortium has on how # JIS X 0208 (1983) characters map into Unicode. # The kanji mappings are a normative part of ISO/IEC 10646. The # non-kanji mappings are provisional, pending definition of # official mappings by Japanese standards bodies However, merely having the tables up on the website led most people to ignore the cautions and interpret them as authoritative tables provided by the Unicode Consortium, anyway. That, in turn, has led over the years to various and repeated reports of "bugs" in the tables -- some presented rather indignantly. And because of differences in implementations, the bug reports sometimes come in equal but opposite pairs. However desirable it may be for somebody to "provide the answer" for everyone about East Asian character set non-Han mappings, the Unicode Technical Committee has not yet determined that it is part of its charter to "standardize" mapping tables, particularly for East Asian non-Han characters, nor is it self-evident how it would go about doing so, given the de facto differences in implementations and preexisting (pre-Unicode) complications in interpretation of some characters in the East Asian standards. The uncertainty within the Unicode Technical Committee as to exactly who owns the mapping problem -- the UTC itself, the East Asian standards committees, or the vendors -- led to the decision last year to move all the East Asian standards mapping tables explicitly to the OBSOLETE directory under the MAPPINGS section of the online data on the website. This leaves the same, unchanged tables available to people if they want, but makes it more obvious that the UTC is not standing behind those tables as representing any authoritative opinion. However, this action itself has led to further misinterpretations. Kubota-san says, "The cross mapping tables for east asian encodings and character sets ... became obsolete [in Unicode 3.1.1]." The fact is that the UTC has no official statuses for mapping tables, and it is meaningless to say that the UTC "obsoleted" some particular mapping table, because none of them are standardized, authoritative, obsoleted, deprecated, superseded, or have any other official status. They are all simply provided for information -- and because of the problematical issues in the East Asian standards mapping tables, they were pushed into the already existing "OBSOLETE" directory to make it more obvious that the UTC wasn't claiming they were authoritative or up-to-date. Kubota-san also says, "Now we don't have any authorized mapping tables for east asian encodings." Well, they weren't "authorized" (in the sense of "authoritative") in the first place -- they were simply individual contributions that were posted for information. Their posting was, of course, authorized, but the content was not authoritative. Kubota-san also goes on to say, "Unicode is a standard. Not supplying an authorized unified reference mapping table seems to show that Unicode abandons the responsibility as a standard." Again, however desirable it might be to have the Unicode Consortium to provide the definitive answer to all East Asian character set interoperability problems, and however much we might like there to be a single, simple answer, it isn't obvious that that is going to happen. The responsibility of the Unicode Consortium as a standards organization is to maintain and develop the *Unicode Standard* -- which it does. The Unicode Consortium is not responsible for the maintenance and interpretation of JIS standards or KS X standards or GB standards or CNS standards and the like. What people may be missing here is that there is no IRG equivalent for the non-Han unification problem in East Asian standards. The IRG provides normative mapping tables for *ideographic* characters as part of the official work of WG2 to encode unified Han characters in 10646 -- and those tables are then published both by ISO and the Unicode Consortium as normative parts of their standards. But there is no "nonIRG" to do the same work for the non-ideographic characters, with official participation by the relevant national standards committees representing their standards. Until such time as a "nonIRG" is put together, it isn't clear how anyone is going to assemble definitive, normative mapping tables for those various legacy standards. In the meantime, it may be possible to do a better job of *explaining* the mapping problems highlighted by Kubota-san. And in fact there have been tentative proposals for someone to write a Unicode Technical Report about East Asian non-ideographic character mapping. That could enable the provision of mapping tables that would have some context, validity, and an explanation of alternative mappings and mapping problems. But until someone steps forward to truly *own* the problem and author such a Technical Report, it is unlikely that the issue will move forward. Ken Whistler, Technical Director, Unicode, Inc. [I don't usually sign myself thus in contributions to this list, but you wanted a definitive answer, so there it is.]