On 6/5/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > It seems to me that what UAX#31 is saying is "Distinguishing (or not) > between 0035 DIGIT 3 and 2075 SUPERSCRIPT 3 should be > equivalent to distinguishing (or not) between LATIN CAPITAL > LETTER A and LATIN SMALL LETTER A." I don't know that > I agree (or disagree) in principle.
So effectively, they consider "a" and "A" to be presentational variants. In some languages, certain presentational variants are used depending on word position. I think the ID_START property does exclude letters that cannot appear in an initial position, but putting a final character in the middle or vice versa would still be wrong. If identifiers are only ever typed, I suppose that isn't a problem. If identifiers are built up in the equivalent of handler="do_" + name then the character will sometimes be wrong in a way that many editors will either hide or silently "correct." The standard also says (but I can't verify) that replacing the presentational variant with the generic form will generally *improve* presentation, presumably because there are now more systems which do the font shaping correctly than there are systems able to handle the old character formats. The folding rules do say that it is OK (even good) to exclude certain characters from certain foldings; I think we could preserve case (including title-case?) as the only presentational variant we recognize. > A scan of the full table for Unicode Version 2.0 (what I have here in > print) suggests that problematic decompositions actually are > restricted to only a few scripts. LATIN (CAPITAL|SMALL) > LETTER L WITH MIDDLE DOT (used in Catalan, cf sec. 5.1 of > UAX#31) As best I understand it, this one would be helped by using compatibility mappings. There is an official way to spell l-middle dot, but enough old texts used the "wrong" character that it has to be special-cased for round-tripping. Since the ID is a final destination, we care less about round-trips, and more about "if they switch editors, will the identifier still match". At the very least, it is mentioned as needing special care (when used as an identifier) in http://www.unicode.org/reports/tr31/ section 5.1 paragraph 1. > decompositions, unlike almost all other Latin decompositions (which > are canonical, and thus get recomposed in NFKC). 'n (Afrikaans), and > a half-dozen Croatian digraphs corresponding to Serbian Cyrillic would > get lost. The Koreans would lose a truckload of partially composed > Hangul and some archaic ones, http://www.unicode.org/versions/corrigendum3.html suggests that many of the Hangul are either pronunciation guide variants or even exact duplicates (that were presumably missed when the canonicalization was frozen?) > the Arabic speakers their presentation forms. http://www.unicode.org/reports/tr31/ 5.1 paragraph 3 includes: """It is recommended that all Arabic presentation forms be excluded from identifiers in any event, although only a few of them must be excluded for normalization to guarantee identifier closure.""" > And that's about it (but I may have missed a bunch because > that database doesn't give the character classes, so I guessed for > stuff like technical symbols -> not ID characters). Depends on what you mean by technical symbols. IMHO, many of them are in fact listed as ID characters. The math versions (generally 1D400 - 1DC7B) are included. But http://unicode.org/reports/tr39/data/xidmodifications.txt suggests excluding them again. > However, of the ones I can judge to some extent (Latin printer's > ligatures, width variants, non-syllabic precomposed Korean Jamo), *not > one* of the compatibility decompositions would be a loss in my > opinion. On the other hand, there are a bunch of cases where NKFC > would be a marked improvement. -jJ _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com