Jim Jewett writes: > > The PEP assumes NFC, but I haven't really understood why, unless that > > is required for compatibility with other systems (in which case, it > > should be made explicit).
"Martin v. Löwis" writes: > It's because UAX#31 tells us to use NFC, in section 5 > > "Generally if the programming language has case-sensitive identifiers, > then Normalization Form C is appropriate; whereas, if the programming > language has case-insensitive identifiers, then Normalization Form KC is > more appropriate." > > As Python has case-sensitive identifiers, NFC is appropriate. It seems to me that what UAX#31 is saying is "Distinguishing (or not) between 0035 DIGIT 3 and 2075 SUPERSCRIPT 3 should be equivalent to distinguishing (or not) between LATIN CAPITAL LETTER A and LATIN SMALL LETTER A." I don't know that I agree (or disagree) in principle. Here's what UAX#15 has to say: ---------------- Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text. It is best to think of these Normalization Forms as being like uppercase or lowercase mappings: useful in certain contexts for identifying core meanings, but also performing modifications to the text that may not always be appropriate. They can be applied more freely to domains with restricted character sets, such as in Section 13, Programming Language Identifiers. ---------------- Note that Section 13 == UAX#31 (from which Martin is quoting). I don't see this section as being at all supportive of NFC over NFKC, though. Some detailed observations biased by my personal tastes: It seems to me that while I sometimes find it useful for FOO and foo to be different identifiers, I would almost always consider R3RS and R³RS to be the same identifier. The contrast is just too small to be useful. And I would never distinguish between a three-character fine (fi - n - e) and a four-character fine (f - i - n - e). I'd really love to see the printer's ligatures gone. I'd love to get rid of full-width ASCII and halfwidth kana (via compatibility decomposition). Native Japanese speakers often use them interchangably with the "proper" versions when correcting typos and updating numbers in a series. Ugly, to say the least. I don't think that native Japanese would care, as long as the decomposition is done internally to Python. A scan of the full table for Unicode Version 2.0 (what I have here in print) suggests that problematic decompositions actually are restricted to only a few scripts. LATIN (CAPITAL|SMALL) LETTER L WITH MIDDLE DOT (used in Catalan, cf sec. 5.1 of UAX#31) are compatibility decompositions, unlike almost all other Latin decompositions (which are canonical, and thus get recomposed in NFKC). 'n (Afrikaans), and a half-dozen Croatian digraphs corresponding to Serbian Cyrillic would get lost. The Koreans would lose a truckload of partially composed Hangul and some archaic ones, the Arabic speakers their presentation forms. And that's about it (but I may have missed a bunch because that database doesn't give the character classes, so I guessed for stuff like technical symbols -> not ID characters). I suspect that as long as they have the precomposed Hangul, partial- syllable "ligature" forms won't be an issue for Koreans. I can't even distinguish the archaic versions from their compatibility equivalents by eye, although I'm comfortable with pronouncing Hangul. I have no opinion on the Latin decompositions mentioned above or the Arabic presentation forms. However, of the ones I can judge to some extent (Latin printer's ligatures, width variants, non-syllabic precomposed Korean Jamo), *not one* of the compatibility decompositions would be a loss in my opinion. On the other hand, there are a bunch of cases where NKFC would be a marked improvement. _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com