On 6/6/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > > > > about normalization of data strings. The big issue is string literals. > > > I think I agree with Stephen here: > > > > u"L\u00F6wis" == u"Lo\u0308wis" > > > > should be True (assuming he typed it correctly in the first place :-), > > > because they are the same Unicode string. > > > So let me explain it. I see two different sequences of code points: > > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308', > > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that > > claim they are equivalent. > > Your (conforming) editor can silently replace one with the other.
No it cannot. We are talking about \u escapes, not about a string literal containing Unicode characters ("Löwis"). > A second editor can silently use one, and not replace the other. > ==> Uncontrollable, invisible bugs. No. Seems you're again not reading before posting. :-( > > They are two different sequences of code points. > > So "str" is about bytes, rather than text? > and bytes is also about bytes; it just happens to be mutable? Bytes are not code points. The unicode string type has always been about code points, not characters. > Then what was the point of switching to unicode? Why not just say > "When printed, a string will be interpreted as if it were UTF-8" and > be done with it? Manipulating code points is a lot more convenient than manipulating UTF-8. > > We should not hide that Python's unicode string object can > > store each sequence of code points equally well, and that when viewed > > as a sequence they are different: the first has len() == 5, the scond > > has len() == 6! > > For a bytes object, that is true. For unicode text, they shouldn't be > different -- at least not by the time a user can see it (or measure > it). Have you ever even used the unicode string type in Python 2? > > I might be writing either literal with the expectation to get exactly that > > sequence of code points, > > Then you are assuming non-conformance with unicode, which requires you > not to depend on that distinction. You should have used bytes, rather > than text. Again, bytes != code points. > http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf (Conformance) > > C9 A process shall not assume that the interpretations of two > canonical-equivalent character sequences are distinct. That is surely contained inside all sorts of weasel words that allow us to define a "normalized equivalence" function that works that way, and leave the "==" operator for arrays of code points alone. > > Note that I'm not arguing against normalization of *identifiers*. I > > see that as a necessity. I also see that there will be border cases > > where getattr(x, 'XXX') and x.XXX are not equivalent for some values > > of XXX where the normalized form is a different sequence of code > > points. But I don't believe the solution should be to normalize all > > string literals. > > For strings created by an extension module, that would be valid. But > python source code is human-readable text, and should be treated that > way. Either follow the unicode rules (at least for strings), or don't > call them unicode. Again, did you realize that the example was about \u escapes? -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com