Guido writes: > > As far as I can tell, CPython on windows uses UTF-16 with code units. > > Perhaps not intentionally, but by default (not throwing an error on > > surrogates). > > This is intentional, to be compatible with the rest of that platform. > Jython and IronPython do this too I believe.
The following code illustrates this: >>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.' >>> msg[35:-18] u'"\U00010143"' >>> greek_five = msg[36:-19] >>> len(greek_five) 2 >>> greek_five[0] u'\ud800' >>> greek_five[1] u'\udd43' The single unicode character greek_five, when expressed as a string in CPython has length of 2 and can be sliced into two separate characters. In Jython, the code above will not work because Jython doesn't currently support \U or extended unicode (but someday that may change). I'm not sure about IronPython. So if I understand Guido's point, he's saying that it is on purpose that len(greek_five) == 2. That's useful for compatibility today with the Java and Microsoft VM platforms. But it's not particularly compatible with extended Unicode. (Technically it doesn't violate any rules so long as it's clearly defined that a character in Python is NOT the same as a unicode code point.) I wonder if it would be better to say that len(greek_five) is undefined in Python. (And obviously slicing behavior follows from len behavior.) There are excellent reasons for CPython to return 2 in the near future, but the far future is less clear. And the Jython and Iron Python will be constrained by common sense to do whatever their underlying platform does, even if that changes in the future. Designing these things would be a lot easier if we had a time machine so we could go see how extended Unicode is used in practice a decade or two from now. Oh, wait.... -- Michael Chermside _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
