On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On 6/6/07, Jim Jewett <[EMAIL PROTECTED]> wrote:
> > On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> >
> > > > about normalization of data strings. The big issue is string literals.
> > > > I think I agree with Stephen here:
> > > > u"L\u00F6wis" == u"Lo\u0308wis"
> > > > should be True (assuming he typed it correctly in the first place :-),
> > > > because they are the same Unicode string.
> > > So let me explain it. I see two different sequences of code points:
> > > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308',
> > > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that
> > > claim they are equivalent.
> > Your (conforming) editor can silently replace one with the other.
> No it cannot. We are talking about \u escapes, not about a string
> literal containing Unicode characters ("Löwis").
ahh... my apologies. I was interpreting the \u as a way of showing
the bytes in email. I discarded the interpretation you are using
because that would require a sequence of 10 or 11 code points, rather
than the 5 or 6 you mentioned.
Python lexes it into a shorter string (just as it lexes 1.0 into a
number) at a conceptually later time. Those later strings should
compare equal according to unicode, but I agree that you no longer
need to worry about editors introducing bugs. (And I even agree that
this may be valid case for ignoring the recommendation; if someone has
been explicit by writing out 6 characters to represent one, they
probably meant it.)
-jJ
_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe:
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com