David Hopwood wrote: > > First of all, as Mark pointed out, there are two quite distinct > > usages of the term in the standard currently. > > > > 1. (decomposition) compatibility character > > > > That is what D21 is about, and is derived on the basis of > > the presence or absence of compatibility decompositions. > > > > 2. (legacy) compatibility character > > > > These are characters that were included in the standard for > > compatibility with other standards, for crossmapping, or > > for other legacy interoperability reasons. Sometimes they > > have compatibility mappings, sometimes they have canonical > > mappings (see, e.g., all the CJK compatibility ideographs), > > and sometimes they have no mappings to other Unicode characters. > > > > The text of the standard is being rewritten to make the distinction > > between these two uses of the term clear. > > Is there any formal definition of a legacy compatibility character > in terms of the Unicode data files, or is it only possible to give a > list? (If the latter, perhaps it would be useful to add a "Legacy" > property to PropList-n.n.n.txt.)
There is no formal definition. It might be nice to have a list, but we'd have to spend a year arguing over its contents. One man's "compatibility character" is another man's "gotta have it required character". By some reckoning, all of the precomposed Latin letters from 8859-1, -2, -3, ... are compatibility characters, for example. The problem is that you cannot really divorce the problem of designating characters as "compatibility characters" (which is a polite way of denigrating them as unwelcome intruders that we have to put up with) from the problem of trying to scope out "Cleanicode" -- what Unicode would have been like if all the scripts and symbols could have been encoded without having to take legacy character encoding mistakes, obsolete implementation practices, encoding committee compromises, and the like into account. And while numerous people have on occasion threatened to go off and define Cleanicode, to date no one has, to my knowledge. It would take a braver man than I to pull out the marker pen and divide all 94,140 Unicode characters into the good ones and the bad ones, and then defend that line against the thousands of people who would disagree about where the line was drawn. Frankly, with some exceptions which the UTC has agreed to call out as particularly egregious, we are probably all better off just living with the ambiguity -- believing that the other guys' bad characters are just there for compatibility, but that my own good characters are full-fledged citizens with no compatibility brand on their flanks. It's more of a standards politics issue than an implementation issue. > > > In my opinion, rather than just "fixing" the D1 definition > > of "compatibility character" to match one or the other > > of these, we need a further clarification of the distinctions, > > and if necessary new terminology to make it easier to know > > which of these sets we are talking about. > > I'd suggest keeping "compatibility character" for NFKD(c) != NFD(c), > and call the other definition just "legacy character". After all, > legacy characters don't have any formal relation to compatibility > equivalence. True enough, but then ASCII A..Z are also legacy characters. And the terminology verges on meaningless. Also, the Unicode Standard now has its own 11 year legacy of claiming that various characters "are encoded for compatibility with XYZ", and calling those characters "compatibility characters". Trying to fix that now to some new terminology would probably introduce as much miscomprehension as it would address. We (the editors) have reluctantly concluded that owning up to and clarifying the polysemy of "compatibility character" in the standard is likely the course of least unintended consequences. --Ken

