Simon Josefsson said: > > sigh.... all 5 are beyond-BMP characters added recently. > > if we could go back in time, we could have implemented a policy of not > > accepting characters as stable before 2 unicode versions had gone > > by.... > > > > proofreading takes time. > > Even two Unicode releases doesn't guarantee that the tables are > correct.
Of course. But in this case, because of the sensitivity of this problem, 5 independent audits of the Plane 2 CJK compatibility characters by CJK experts have converged on the same answer. There are 5 mistakes (4 clerical and one visual) in the current mappings for that set of recently added 542 supplementary characters on Plane 2. [There are also other known issues in the mappings, whereby one Han variant or another might be a "better" mapping, but there is also consensus among the experts that none of those issues rise to the level of blatant *mistakes* that must be corrected in the current tables -- which is why none of those is being ballotted or will be in the future.] The U.S. national body and the UTC are both also pushing back hard on the proposed addition of another 122 CJK compatibility characters because the CJK mapping experts have discovered errors in *that* table as well. In this case, having learned our lesson, the UTC is trying to be proactive and ensure that all errors are removed from the table *before* such an addition is standardized. > The only proper solution I can see is to stop modifying > published decomposition tables. When mistakes are discovered, new > character codes with proper decompositions should be added and the old > character codes declared obsolete -- which is option B in the vote, This will lead to other interoperability problems. The 542 supplementary characters in question (and all of the ones involving the errors) are CNS compatibility characters. They are there to provide round-trip mappings to the CNS 11643 standard. If you "obsolete" 5 code points and then add 5 new ones, then it is inevitable that CNS mapping tables will get updated to use the new code points instead of the old ones (and there will be some inconsistency in the mappings, because of the duplications, during this transition) -- because the old code points get normalized away to nonsense characters. This will undoubtedly lead to further problems, including for IDNA string matching, as one of the duplicated pair normalizes one way, and the other -- apparently identical -- normalizes another way. And you can't escape the problem by just adding the 5 obsolete code points to the stringprep prohibited list, because that, *too*, would have destabilized your specification: a string that was valid before you did that would be invalid after you did so. > but unfortunately neither IDN nor IETF has any voting powers (which > suggest a methodological problem). Why would IDN have voting powers here? You don't expect the UTC or SC2/WG2 to have voting powers in an IDN working group, do you? As for the IETF, the UTC and the IETF have a *liaison* relationship. The UTC immediately informed the IETF liaison about this ballot, because it knew this was an important issue that IETF participants are concerned about. That is why this discussion has migrated over to interested parties on the IDN list who have worked on IDNA and stringprep. But the buck has to stop somewhere. Ultimately the UTC and WG2 are responsible for the CJK compatibility character mapping tables. So those committees have to take the relevant votes, and if they end up standardizing errors, also have to take the relevant knocks when they go to fix the errors. To influence the actual *voting* on this (or other issues), one works through the UTC voting member representatives in the UTC case, or one works through the national bodies participating in SC2 in the SC2/WG2 case. --Ken
