Bill Janssen <[EMAIL PROTECTED]> wrote: > > > Hear me out for a moment. People type what they want. > > I do a lot of Pythonic processing of UTF-8, which is not "typed by > people", but instead extracted from documents by automated processing. > Text is also data -- an important thing to keep in mind.
Right, but (and this is a big but), you are reading data in from a file. That is different from source code identifiers and embedded strings. If you *want* normalization to happen on your data, that is perfectly reasonable, and you can do so (Explicit is better than implicit?). But if someone didn't want normalization, and Python did it anyways, then there would be an error that passed silently. > As far as normalization goes, I agree with you about identifiers, and > I use "unicodedata.normalize" extensively in the cases where I care > about normalization of data strings. The big issue is string literals. > I think I agree with Stephen here: > > u"L\u00F6wis" == u"Lo\u0308wis" > > should be True (assuming he typed it correctly in the first place :-), > because they are the same Unicode string. I don't understand Guido's > objection here -- it's a lexer issue, right? The underlying character > string will still be the same in both cases. It's the unicode character versus code point issue. I personally prefer code points, as a code point approach does exactly what I want it to do by default; nothing. If it *does* something without me asking, then that would seem to be magic to me, and I'm a minimal magic kind of guy. - Josiah _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com