> Hear me out for a moment.  People type what they want.

I do a lot of Pythonic processing of UTF-8, which is not "typed by
people", but instead extracted from documents by automated processing.
Text is also data -- an important thing to keep in mind.

As far as normalization goes, I agree with you about identifiers, and
I use "unicodedata.normalize" extensively in the cases where I care
about normalization of data strings.  The big issue is string literals.
I think I agree with Stephen here:

    u"L\u00F6wis" == u"Lo\u0308wis"

should be True (assuming he typed it correctly in the first place :-),
because they are the same Unicode string.  I don't understand Guido's
objection here -- it's a lexer issue, right?  The underlying character
string will still be the same in both cases.

But it's complicated.  Clearly we expect

    (u"abc" + u"def") == (u"a" + u"bcdef")

to be True, so

    (u"L" + u"\u00F6" + u"wis") == (u"Lo" + u"\u0308" + u"wis")

should also be True.  Where I see difficulty is

    (u"L" + unchr(0x00F6) + u"wis") == (u"Lo" + unichr(0x0308) + u"wis")

I suppose unichr(0x0308) should raise an exception -- a combining
diacritic by itself shouldn't be convertible to a character.

Bill


_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Reply via email to