On Aug 25, 9:53 pm, "Mark Tolonen" <metolone+gm...@gmail.com> wrote: > <ru...@yahoo.com> wrote in message > > news:2ad21a79-4a6c-42a7-8923-beb304bb5...@v20g2000yqm.googlegroups.com... > > > > > In Python 2.5 on Windows I could do [*1]: > > > # Create a unicode character outside of the BMP. > > >>> a = u'\U00010040' > > > # On Windows it is represented as a surogate pair. > > >>> len(a) > > 2 > > >>> a[0],a[1] > > (u'\ud800', u'\udc40') > > > # Create the same character with the unichr() function. > > >>> a = unichr (65600) > > >>> a[0],a[1] > > (u'\ud800', u'\udc40') > > > # Although the unichr() function works fine, its > > # inverse, ord(), doesn't. > > >>> ord (a) > > TypeError: ord() expected a character, but string of length 2 found > > > On Python 2.6, unichr() was "fixed" (using the word > > loosely) so that it too now fails with characters outside > > the BMP. > > > >>> a = unichr (65600) > > ValueError: unichr() arg not in range(0x10000) (narrow Python build) > > > Why was this done rather than changing ord() to accept a > > surrogate pair? > > > Does not this effectively make unichr() and ord() useless > > on Windows for all but a subset of unicode characters? > > Switch to Python 3? > > >>> x='\U00010040' > >>> import unicodedata > >>> unicodedata.name(x) > > 'LINEAR B SYLLABLE B025 A2'>>> ord(x) > 65600 > >>> hex(ord(x)) > '0x10040' > >>> unicodedata.name(chr(0x10040)) > > 'LINEAR B SYLLABLE B025 A2'>>> ord(chr(0x10040)) > 65600 > >>> print(ascii(chr(0x10040))) > > '\ud800\udc40' > > -Mark
I am still a long way away from moving to Python 3 but I am looking forward to hopefully more rational unicode handling there. Thanks for the info. -- http://mail.python.org/mailman/listinfo/python-list