> > 282 MES-2 is specified by the following ranges of code positions as > > indicated for each row...
Philippe Verdy asked: > As most of these characters are canonically decomposable, shouldn't this > list include also the decomposed characters? > > Why is row 03 so resticted? Shouldn't it include those accents and > diacritics that are used by other characters once canonically > decomposed? Or does it imply that MES-2 is only supposed to use > strings if NFC form? MES-2 (and all the rest of the Multilingual European Subsets) are a CEN construct. See the CEN Workshop Agreement, CWA 13873:2000 posted at Michael Everson's site: http://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf Among other things, that CWA states: "This CWA does *not* specify any encoding of the European Subsets." so conceptually it is more like a repertoire listing. MES-2 is formally listed in 10646 as one of the normative subsets there, but since 10646 has no concepts of decomposition, normalization, or equivalence, the fact that MES-2 contains precomposed characters but not their decompositions or the relevant combining accents is formally irrelevant. The Unicode Standard does not make subsets a normative construct for that standard and doesn't even mention MES-2. Conformance to 10646 doesn't require you to make use of its subsets, but if anyone is worried about the articulation of the standards, the Unicode Standard itself formally consists of Subset 305 of 10646:2003, namely the "UNICODE 4.0" subset -- the subset which contains *all* of the encoded characters of 10646:2003. Think of the Multilingual European Subsets as a kind of way for people in Europe associated with standards organizations and governments to try to communicate with software vendors regarding which "user characters" they want to ensure are supported by their software. The CWA 13873 contains some questionable presuppositions about how software vendors are actually proceeding to roll out their Unicode support, but the intent of the CWA is clear: "It is estimated that implementing the full character set of the UCS may be costly in the first stages of UCS use, and that many manufacturers will implement in subset-stages. To ensure that a common subset usable to the vast majority of European users be available for a reasonable price, and as a guide to manufacturers, it will be helpful to specify, to users and procurers of systems, European subsets of the UCS encompassing the characters for use in European languages as well as other frequently used and specialist characters." > Also, is this list under full closure with existing character properties, like > NFKD decompositions, and case mappings? MES-2 is clearly *not* closed under NFD, NFKD, or NFKC normalizations. Although less obvious, it is also not closed under NFC normalization. For example, it includes the angle brackets U+2329, U+232A, but not their canonical equivalents, U+3008, U+3009. There are also some characters outside the MES-2 repertoire where NFC(x) *is* in the MES-2 repertoire. Singleton canonical equivalences like U+212B ANGSTROM SIGN come to mind, for example. I haven't checked on case mappings and case foldings, but would not be too surprised to find an anomaly or two there, as well. MES-2 was not designed by the UTC, nor did it take any of these considerations into account. It is not really an appropriate construct for the Unicode Standard. A more meaningful way to think of it is: if you want to sell software in Europe, you better be able to input and display all the characters we Europeans have in this list. --Ken

