What he said. IOW, we're treating each half of a surrogate as a "character", at least for purposes of counting items in a string. (Otherwise operations like len() and indexing/slicing would no longer be O(1).)
--Guido On 6/2/07, Josiah Carlson <[EMAIL PROTECTED]> wrote: > > "Alexandre Vassalotti" <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I was doing some testing on the new _string_io module, since I was > > slightly skeptical on my handling of wide Unicode characters (32-bit > > of length, instead of the usual 16-bit in UTF-16). So, I ran this > > little test: > > > > >>> s = _string_io.StringIO() > > >>> s.write(u'晉') > > >>> s.tell() > > 2 > > > > Like I expected, wide Unicode characters count for two. However, I was > > surprised that Python treats them as two characters as well: > > > > >>> len(u'晉') > > 2 > > >>> u'晉' > > u'\ud87e\udccd' > > > > Is it a bug, or only an implementation choice? > > If your Python is compiled as a UTF-16 build, then any character in the > extended plane will be seen as two characters by Python. If you are > using a UCS-4 build (it's the same as UTF-32), then you should be seeing > the single wide character as a single wide character. The only > exception to this rule is if you enter the wide character as a surrogate > pair, in which case Python doesn't normalize it into the single wide > character. To get a real wide character, you would need to use a proper > escape, or decode from an encoded string. > > > - Josiah > > _______________________________________________ > Python-3000 mailing list > [email protected] > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: > http://mail.python.org/mailman/options/python-3000/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
