On 6/5/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > 1. Python will lose the ability to make a reliable round trip to > > a human-readable display on screen or on paper.
> Correct. Was already the case, though, because of comments > and string literals. But these are usually less important; when written as literals, they are normally part of the User Interface, and if the user can't see the difference, it doesn't matter. There are exceptions, such as the "HELO" magic cookie in the (externally defined) SMTP protocol, but I think these exceptions are uncommon -- and outside python's control anyhow. > > 5. Languages with non-ASCII identifiers use different > > character sets and normalization schemes; PEP 3131's > > choices are non-obvious. > I disagree. PEP 3131 follows UAX#31 literally, and makes that > decision very clear. If people still cannot see that, I think "obvious" referred to the reasoning, not the outcome. I can tell that the decision was "NFC, anything goes", but I don't see why. (1) I am not sure why it was NFC; UAX 31 seems agnostic on which normalization form to use. The only explicit recommendations I can find suggest using NFKC for identifiers. http://www.unicode.org/faq/normalization.html#2 (Outside of that recommendation for KC, it isn't even clear why we should use the Composed form. As of tonight, I realized that "composed" means less than I thought, and the actual algorithm means it should work as well as the Decomposed forms -- but I had missed that detail the first several times I read about the different Normalization forms, and it certainly isn't included directly in the PEP.) (2) I cannot understand why ID_START/CONTINUE was chosen instead of the newer and more recommended XID_START/CONTINUE. From UAX31 section 2: """ The XID_Start and XID_Continue properties are improved lexical classes that incorporate the changes described in Section 5.1, NFKC Modifications. They are recommended for most purposes, especially for security, over the original ID_Start and ID_Continue properties. """ Nor can I understand why the additional restrictions in xidmodifications (from TR39) were ignored. The reason to remove those characters is given as """ The restricted characters are characters not in common use, removed so as to further reduce the possibilities for visual confusion. Initially, the following are being excluded: characters not in modern use; characters only used in specialized fields, such as liturgical characters, mathematical letter-like symbols, and certain phonetic alphabetics; and ideographic characters that are not part of a set of core CJK ideographs consisting of the CJK Unified Ideographs block plus IICore (the set of characters defined by the IRG as the minimal set of required ideographs for East Asian use). A small number of such characters are allowed back in so that the profile includes all the characters in the country-specific restricted IDN lists: """ As best I can tell, the remaining list is *still* too generous to be called conservative, but the characters being removed are almost certainly good choices for removal -- no one's native language requires them. > > B. Should the default behaviour accept only ASCII identifiers, or > > should it accept identifiers containing non-ASCII characters? > > D. Should the identifier character set be configurable? > Still seems to be the same open issue. Defaulting to ASCII or defaulting to "accept unicode" is one issue. A related but separate issue is whether accepting unicode is a single on/off switch, or whether it will be possible to accept only some unicode characters. As written, there is no good way to accept, say, Japanese characters, but not Cyrillic. I would prefer to whitelist individual characters or scripts, but there should at least be a way to exclude certain characters. http://www.unicode.org/reports/tr39/data/intentional.txt is a list of characters that *should* be impossible to distinguish visually. It isn't just that the standard representations are identical; (like some of the combining marks looking like quote signs), it is that the (distinct abstract) characters *should* use the same glyph, so long as they are in the same (or even harmonized) fonts. Several of the Greek and Cyrillic characters are glyph-identical with ASCII letters. I won't say that people using those scripts shouldn't be allowed to use those letters, but *I* certainly don't want to get code using them just because I allowed the ö. -jJ _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com