John, > It probably isn't worth your time, my time, or especially that > of the WG to go through this in detail. The bottom line is that > we would be stuck with Unicode if it were an act of complete > beauty that had resolved all tradeoffs to everyone's > satisfaction for all purposes, and we would be stuck with it if > it got many of those tradeoffs wrong from the standpoint of > IDN/DNS use (regardless of whether they were made optimally from > other perspectives).
As with Mark, I am in agreement with this assessment. > That said, three comments that seem to need making: > ... But in turn, I find that I still need to respond to some of your comments. > > There are other seeming anachronisms in your version of the > story (e.g., the original design base for 10646 was purely as a > 32-bit character set, so a criticism on the basis of what fit > into "the BMP" is a little strange -- while there were some > early attempts to push 16-bit subsets (mostly from printer > vendors, if I recall), unless my memory is failing severely, the > concept of a "BMP" originated with the Unicode merger. This is the kind of reconstructed reality that Mark was objecting to in his initial post. Memory is a tricky beast. DP 10646 (which preceded DP2 10646 and DIS-1 10646, let alone the result of the Unicode merger, which was DIS-2 10646, dated 26-December-1991) *already* had the concept *and* exact term "Basic Multilingual Plane". And incidentally, it was Group 032 Plane 032, to be exact. SPACE was encoded as G=032 P=032 R=032 C=032, *not* U-00000020 as we have so conveniently gotten used to in 10646 now. 10646 was *never* "purely...a 32-bit character set". It had an architecture, from the start, which consisted of cells, rows, planes, and groups, that together constituted a 32-bit encoding space, but it was always a multiple-octet character set. DP 10646 had 7 forms of use: 1, 1A, 2, 2A, 3A, 4 and 5. Those were single-byte, double-byte, triple-byte, quadruple-byte, and a (limited) form of mixed-byte, respectively, with the "A" forms also allowing use of a SINGLE GRAPHIC CHARACTER INTRODUCER byte. With the exception of form of use 4, these forms were often referred to at the time as "compaction methods". U.S. ballot comments on DP 10646 (dated 28 April 1989, Doc. X3L2/89-76) requested that this be reorganized to 5 forms of use: 1, 2, 3, 4, 5 (as above), with two levels corresponding to use or non-use of the SGCI. (Needless to say, that was before the Unicode advocates had much influence on the wording of U.S. ballot comments on 10646.) All that was drastically simplified later in the DIS-2 draft (the product of the Unicode/10646 merger talks) which dropped the 1-, 3-, and mixed-byte forms of use, the SGCI (as well as the HOP, which had allowed in-stream announcements of the forms of use). DIS-2 had simply two forms of use: UCS-2 and UCS-4 -- and that continues through to today in ISO/IEC 10646-1:2000. UCS-2 was, of course, the (then) Unicode-compatible way of using 10646. DIS-2 also dropped the C0/C1 restrictions on octets and rearranged everything down to the origin point, so that UCS-4 could be used as a wchar_t implementation, among other things. As for the push for "16-bit subsets", the following note may be a useful counter-tonic. This is from the official Chinese national body comments, dated May 29, 1991, in their disapproval of DIS-1 10646: "The current DIS 10646 allocates Chinese Hanzi, Japanese Kanji and Korean Hanja into different planes, while the three zones of B.M.P., I-01, I-10 and I-11, have not been defined yet. There is not even any explaination [sic] found in the DIS text for the vacancy. Therefore, the plane 032 can not be regarded as a real Basic Multilingual Plane because of the absence of ideographs. Consequently, we would like to request to include Unified CJK Ideographs into BMP based on the structure proposed in the above item 1. [Item 1 refers to removal of the C0/C1 restriction. --kenw] In order to avoid unnecessary duplicate work, China is willing and pleased to contribute the document of the repertoire of HCS [Han Character Set --kenw] as the basis for discussion." Earlier, in the official Chinese national body comments on DP2 10646, dated February 16, 1990, China commented in their negative ballot on that draft: "In the second DP, the sparate [sic] assignment of C/J/K characters not only wastes the valuable B.M.P. and excludes the unsimplified Chinese characters which have important practical value, but also directly violates the principle of Character encoding by script rather than language/country which is laid down in DP10646. And the situation of one character with more than one codes [sic] resulting from this will cause serious frustration to the future multi-lingual applications. Therefore, The Chinese National Body does not approve the arrangement of Han Characters as specified in the second DP 10646." I don't think the Chinese national standards body could realistically be considered as a "printer vendor". These comments speak both to the then general desire to have the Basic Multilingual Plane be a usable international subset of the entire construct of 10646, as well as to the Chinese disaffection with language/country-specific encoding of Han characters and requirement for a meaningful Han unification in 10646. > > (iii) The real points of my raising those historical issues were > the one you seem to have missed, so let me assume I wasn't clear > and say it explicitly. As I hope most of the participants in > IDN have long ago figured out, this business of trying to create > a single "UCS" is one that involves many complex tradeoffs and > for which there are often no easy answers. On this we are in agreement. The problem is in misrepresentations of Unicode and/or 10646 history in service of making the point. > > To give just a few examples,... > > * Keeping scripts together and preserving widely-used > earlier CCSs as blocks is A Good Thing. But having > exactly one code point associated with a given glyph/ > letter shape is also A Good Thing. One can't have both. For a limited subset of Latin/Greek/Cyrillic, having one code point associated with a given glyph has (by some) been considered A Good Thing. But in the general context of a Universal Character Set, it is clearly A Bad Thing. That approach leads to total botches of Arabic or Indic processing for example. And the tradeoff here has little to do with preserving the structure of earlier CCSs as blocks. > * Han unification provides many benefits. Han > unification also causes some problems (one of which, in > the present instance, is that one appears to need > metadata to map between TC and SC without doing nasty > things to Kanji). One cannot both do unification and > not do unification. Han unification has nothing to do with the TC/SC problem. There are tradeoffs, but they aren't *this* tradeoff. Han unification neither created nor eliminated the TC/SC distinctions. Regards, --Ken
