On 6/12/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote: > > Practically speaking, there's little need to interpret surrogate pairs > > as two code points instead of as one non-BMP code point. > > Depends on your definition of "practically". > > Python does interpret them that way to maintain O(1) positional access > within strings encoded with 16 bits/char.
Indexing does not try to interpret the string as code points at all, it works on code units. The difference is easier to see if you imagine Python using utf-8 for strings. Indexing would still work on (8-bit) code units instead of code points. It is higher level operations such as unicodedata.normalize() that need to interpret strings as code points. For 16-bit code units there are two interpretations, depending on whether you think that surrogate pairs mean one (UTF-16) or two (UCS-2) code points. Incidentally, unicodedata.normalize() is an example that currently does interpret its input as UCS-2 instead of UTF-16. If you pass it a surrogate pair it thinks of them as two code points, and won't do any normalization for anything outside BMP on a UCS-2 build. Another example would be unichr(), which gives you TypeError if you pass it a surrogate pair (oddly enough, as strings of different length are of the same type). _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com