On 6/7/07, Bill Janssen <[EMAIL PROTECTED]> wrote: > I meant to say that *strings* are explicitly sequences of characters, > not codepoints.
This is false. When you access the contents of a string using the *sequence* protocol, what you get is code points, not characters (grapheme clusters). To get those, you have to use a regexp, as outlined in UAX#29. You could normalize at the same time so you can do bitwise comparison instead of collation to compare graphemes the way the user does. If you're going to do all that, then you could as well implement your own type (which could even be provided by the standard library). Note that normalization alone does not produce a sequence of grapheme clusters, because there aren't precomposed characters for everything - for full generality you just have to deal with combining characters. > I also believe that the literal form '\u0308' should generate a compile > error. It's a valid Unicode codepoint, sure, but not a valid string. Then you wouldn't even be able to iterate over or index strings anymore, as that could produce such "invalid" strings, which would need to generate exceptions if you really want to ban them. Or is there point in making people type 'o\u0308'[1] instead of '\u0308'? _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com