> Hear me out for a moment. People type what they want. I do a lot of Pythonic processing of UTF-8, which is not "typed by people", but instead extracted from documents by automated processing. Text is also data -- an important thing to keep in mind.
As far as normalization goes, I agree with you about identifiers, and I use "unicodedata.normalize" extensively in the cases where I care about normalization of data strings. The big issue is string literals. I think I agree with Stephen here: u"L\u00F6wis" == u"Lo\u0308wis" should be True (assuming he typed it correctly in the first place :-), because they are the same Unicode string. I don't understand Guido's objection here -- it's a lexer issue, right? The underlying character string will still be the same in both cases. But it's complicated. Clearly we expect (u"abc" + u"def") == (u"a" + u"bcdef") to be True, so (u"L" + u"\u00F6" + u"wis") == (u"Lo" + u"\u0308" + u"wis") should also be True. Where I see difficulty is (u"L" + unchr(0x00F6) + u"wis") == (u"Lo" + unichr(0x0308) + u"wis") I suppose unichr(0x0308) should raise an exception -- a combining diacritic by itself shouldn't be convertible to a character. Bill _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com