On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > Why should the lexer apply normalization to literals behind my back?
The lexer shouldn't, but NFC normalizing the source before the lexer sees it would be slightly more robust and standards-compliant. This is because technically an editor or any other program is allowed by the Unicode standard to apply any normalization or other canonical equivalent replacement it sees fit, and other programs aren't supposed to care. The standard even says that such differences should be rendered in an indistinguishable way. Practically everyone uses NFC, though. > There's a simpler solution. The unicode (or str, in Py3k) data type > represents a sequence of code points, not a sequence of characters. > This has always been the case, and will continue to be the case. This is how Java and ICU (http://www.icu-project.org/) do it, too. The latter is a library specifically designed for processing Unicode text. Both Java and ICU are even mentioned in the Unicode FAQ. > Clearly we will have a normalization routine so the lexer can > normalize identifiers, so if you need normalized data it is > as simple as writing 'XXX'.normalize() (or whatever the spelling > should be). The routine is at the moment at unicodedata.normalize. _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com