On 6/6/07, Bill Janssen <[EMAIL PROTECTED]> wrote: > > Hear me out for a moment. People type what they want. > > I do a lot of Pythonic processing of UTF-8, which is not "typed by > people", but instead extracted from documents by automated processing. > Text is also data -- an important thing to keep in mind. > > As far as normalization goes, I agree with you about identifiers, and > I use "unicodedata.normalize" extensively in the cases where I care > about normalization of data strings. The big issue is string literals. > I think I agree with Stephen here: > > u"L\u00F6wis" == u"Lo\u0308wis" > > should be True (assuming he typed it correctly in the first place :-), > because they are the same Unicode string. I don't understand Guido's > objection here -- it's a lexer issue, right? The underlying character > string will still be the same in both cases.
So let me explain it. I see two different sequences of code points: 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308', 'w', 'i', 's' on the other. Never mind that Unicode has semantics that claim they are equivalent. They are two different sequences of code points. We should not hide that Python's unicode string object can store each sequence of code points equally well, and that when viewed as a sequence they are different: the first has len() == 5, the scond has len() == 6! When read from a file they are different. Why should the lexer apply normalization to literals behind my back? I might be writing either literal with the expectation to get exactly that sequence of code points, in order to use it as a test case or as input for another program that requires specific input. > But it's complicated. Clearly we expect > > (u"abc" + u"def") == (u"a" + u"bcdef") > > to be True, so > > (u"L" + u"\u00F6" + u"wis") == (u"Lo" + u"\u0308" + u"wis") > > should also be True. Where I see difficulty is > > (u"L" + unchr(0x00F6) + u"wis") == (u"Lo" + unichr(0x0308) + u"wis") > > I suppose unichr(0x0308) should raise an exception -- a combining > diacritic by itself shouldn't be convertible to a character. There's a simpler solution. The unicode (or str, in Py3k) data type represents a sequence of code points, not a sequence of characters. This has always been the case, and will continue to be the case. Note that I'm not arguing against normalization of *identifiers*. I see that as a necessity. I also see that there will be border cases where getattr(x, 'XXX') and x.XXX are not equivalent for some values of XXX where the normalized form is a different sequence of code points. But I don't believe the solution should be to normalize all string literals. Clearly we will have a normalization routine so the lexer can normalize identifiers, so if you need normalized data it is as simple as writing 'XXX'.normalize() (or whatever the spelling should be). -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
