On Wed, 26 Aug 2009 16:27:33 -0700, rurpy wrote: > But regardless, the significant question is, what is the reason for > having ord() (and unichr) not work for surrogate pairs and thus not > usable with a large number of unicode characters that Python otherwise > supports?
I'm no expert on Unicode, but my guess is that the reason is out of a desire for simplicity: unichr() should always return a single char, not a pair of chars, and similarly ord() should take as input a single char, not two, and return a single number. Otherwise it would be ambiguous whether ord(surrogate_pair) should return a pair of ints representing the codes for each item in the pair, or a single int representing the code point for the whole pair. E.g. given your earlier example: >>> a = u'\U00010040' >>> len(a) 2 >>> a[0] u'\ud800' >>> a[1] u'\udc40' would you expect ord(a) to return (0xd800, 0xdc40) or 0x10040? If the latter, what about ord(u'ab')? Remember that a unicode string can contain code points that aren't valid characters: >>> ord(u'\ud800') # reserved for surrogates, not a character 55296 so if ord() sees a surrogate pair, it can't assume it's meant to be treated as a surrogate pair rather than a pair of code points that just happens to match a surrogate pair. None of this means you can't deal with surrogate pairs, it just means you can't deal with them using ord() and unichr(). The above is just my guess, I'd be interested to hear what others say. -- Steven -- http://mail.python.org/mailman/listinfo/python-list