On 6/6/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > I think "obvious" referred to the reasoning, not the outcome.
> > I can tell that the decision was "NFC, anything goes", but I don't see why. > I think I'm repeating myself: Because UAX 31 says so. That's it. There > is a standard that experts in the domain have specified, and PEP 3131 > follows it. Following standards is a good thing, deviating from them > is a bad thing. I think we are reading UAX31 very differently. If it is (or even seems) ambiguous, then we need to specify our interpretation. > > (2) > > I cannot understand why ID_START/CONTINUE was chosen instead of the > > newer and more recommended XID_START/CONTINUE. From UAX31 section 2: > > """ > > The XID_Start and XID_Continue properties are improved lexical classes > > that incorporate the changes described in Section 5.1, NFKC > > Modifications. They are recommended for most purposes, especially for > > security, over the original ID_Start and ID_Continue properties. > > """ > Right. I read it that these should be used when 5.1 is considered > in the language. This, in turn, should be used when the > normalization form is NFKC: I read that as XID is almost always better. XID is better for security in particular, but also better for other things. And as an extra bonus, XID even already takes care of some 5.1 issues for you. And my personal opinion is that those 5.1 issues are not really restricted to NFKC. Other normalization forms won't get syntactic errors over them, but the results could still be nonsense. Issue 1 is that Catalan treats a 0xB7 as a character instead of as punctuation. The unicode recommendation (*required* only for NFKC, but already supported by XID, since it is recommended) says "OK, it isn't syntax or whitespace, and it is a character sometimes in practice, so we'll allow it." Issue 2 says "Technically these are characters, but they should never be used to start a word, so don't start an identifier with them anyhow." If you're not using NFKC, you *can* just ignore the problem (and produce garbage), but you probably shouldn't. XID takes care of it for you. (At least for these characters.) Issue 3 says "OK, these characters don't work with NFKC -- but you shouldn't be using them anyhow." It even says explicitly that "It is recommended that all Arabic presentation forms be excluded from identifiers in any event" Note that neither ID nor XID actually remove all the Arabic presentation forms, despite this clear recommendation. Technically, they are characters, and *could* be processed. XID removes the ones that break NFKC, and xidmodifications removes some more (hopefully, all the rest, but I haven't verified that). > """ > Where programming languages are using NFKC to fold differences between > characters, they need the following modifications of the identifier > syntax from the Unicode Standard to deal with the idiosyncrasies of a > small number of characters. These modifications are reflected in the > XID_Start and XID_Continue properties. > """ > As the PEP does not use NFKC (currently), it should not use XID_Start > and XID_Continue either. I read that as "If you are using NFKC, then you need to do some extra work. But notice that if you are using the new and improved XID, then some of this work was already done for you..." > > Nor can I understand why the additional restrictions in > > xidmodifications (from TR39) were ignored. > Consideration of UTR 39 is listed as an open issue. One problem > with it is that using it would restrict the language over time, > so that previously correct programs might not be correct anymore > in a future version. So using it might break backwards > compatibility. Then we should start with a more restricted charset, and expand it over time. The restrictions in xidmodifications are not remotely sufficient for security, even now. (Doing that would require restricting some characters that are actually needed in some languages.) Instead, xidmodifications represents (a mechanically determined subset of) characters that can be removed cheaply, because they shouldn't be used in identifiers anyhow. -jJ _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com