(Martin's right, it's not good to discuss this in the huge PEP 3131 thread, so I'm changing the subject line)
On 6/6/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > In the language of these standards, I would expect that string > comparison is exactly the kind of higher-level process they have in > mind. In fact, it is given as an example in what Jim quoted above. > > > > C9 A process shall not assume that the interpretations of two > > > canonical-equivalent character sequences are distinct. > > > > Right. What is "a process"? > > Anything that accepts Unicode on input or produces it on output, and > claims to conform to the standard. Strings are internal to Python. This is a whole separate issue from normalization of source code or its parts (such as identifiers). Once you have read in a text file and done the normalizations you want to, what you have left is an internal representation in memory, which may be anything that's convenient to the programmer. The question is, what is convenient to the programmer? > > > Ideally, an implementation would always interpret two > > > canonical-equivalent character sequences identically. There are > > > > So it should be the application's choice. > > I don't think so. I think the kind of practical circumstance they > have in mind is (eg) a Unicode document which is PGP-signed. PGP > clearly will not be able to verify a canonicalized document, unless it > happened to be in canonical form when transmitted. But I think it is > quite clear that they do not admit that an implementation might return > False when evaluating u"L\u00F6wis" == u"Lo\u0308wis". It is up to Python to define what "==" means, just like it defines what "is" means. It may be canonical equivalence for strings, but then again it may not. It depends on what you want, and what you think strings are. If you think they're sequences of code points, which is what they act like in general (assuming UTF-32 was selected at compile time), then bitwise comparison is quite consistent whether the string is in normalized form or not. Handling strings as sequences of code points is the most general and simple thing to do, but there are other options. One is to simply change comparison to be collation (and presumably also make regexp matching and methods like startswith consistent with that). Another is to always keep strings in a specific normalized form. Yet another is to have another type for strings-as-grapheme-sequences, which would strictly follow user expectations for characters (= graphemes), such as string length and indexing, comparison, etc. Changing just the comparison has the drawback that many current string invariants break. a == b would no longer imply any of len(a) == len(b), set(a) == set(b), a[i:j] == b[i:j], repr(a) == repr(b). You'd also have to use bytes for any processing of code point sequences (such as XML processing), because most operations would act as if you had normalized your strings (including dictionary and set operations), and if you have to do contortions to avoid problems with that, then it's easier to just use bytes. There would also be security implications with strings comparing equal but not always quite acting equal. Always doing normalization would still force you to use bytes for processing code point sequences (e.g. XML, which must not be normalized), which is not nice. It's also not nice to force a particular normalization on the programmer, as another one may be better for some uses. E.g. an editor may be simpler to implement if everything is consistently decomposed (NFD), but for most communication you'd want to use NFC, as you would for many other processing (e.g. the "code point == grapheme" equation is perfectly adequate for many purposes with NFC, but not with NFD). Having a type for grapheme sequences would seem like the least problematic choice, but there's little demand for such a type. Most intelligent Unicode processing doesn't use a grapheme representation for performance reasons, and in most other cases the "code point == grapheme" equation or treatment of strings as atoms is adequate. The standard library might provide this type if necessary. > > So this *allows* to canonicalize strings, it doesn't *require* Python > > to do so. Indeed, doing so would be fairly expensive, and therefore > > it should not be done (IMO). > > It would be much more expensive to make all string comparisons grok > canonical equivalence. That's why it *allows* canonicalization. FWIW, I don't buy that normalization is expensive, as most strings are in NFC form anyway, and there are fast checks for that (see UAX#15, "Detecting Normalization Forms"). Python does not currently have a fast path for this, but if it's added, then normalizing everything to NFC should be fast. _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
