On 6/7/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > What bothers me about the "sequence of code points" way of thinking is > that len("Löwis") is nondeterministic.
It doesn't have to be, *for this specific example*. After what I've read so far, I'm okay with normalization happening on the text of the source code before it reaches the lexer, if that's what people prefer. I'm also okay with normalization happening by default in the text I/O layer, as long as there's a way to disable it that doesn't require me to switch to bytes. However, I'm *not* okay with requiring all text strings to be normalized, or normalizing them before comparing/hashing, after slicing/concatenation, etc. If you want to have an abstraction that guarantees you'll never see an unnormalized text string you should design a library for doing so. I encourage you or others to contribute such a library (*). But the 3.0 core language's 'str' type (like Python 2.x's 'unicode' type) will be an array of code points that is neutral about normalization. Python is a general programming language, not a text manipulating library. As a general programming language, it must be possible to represent unnormalized sequences of code points -- otherwise, it could not implement algorithms for normalization in Python! (Again, forcing me to do this using UTF-8-encoded bytes or lists of ints is unacceptable.) There are also Jython and IronPython to consider. These have extensive integration in the Java and .NET runtimes, respectively, where strings are represented as sequences of code points. Having a correspondence between the "natural" string type across language boundaries is very important. Yes, this makes text processing harder if you want to get every corner case right. We need to educate our users about Unicode and point them to relevant portions of the standard. I don't think that can be avoided anyway -- the complexity is inherent to the domain of multi-alphabet text processing, and cannot be argued away by insisting that the language handle it. (*) It looks like such a library will not have a way to talk about "\u0308" at all, since it is considered unnormalized. Things like bidirectionality will probably have to be handled in a different way (without referencing the code points indicating text direction) as well. -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com