Re: [Python-3000] String comparison

Rauli Ruohonen Tue, 12 Jun 2007 08:23:11 -0700

On 6/12/07, Jim Jewett <[EMAIL PROTECTED]> wrote:
> On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote:
> > Practically speaking, there's little need to interpret surrogate pairs
> > as two code points instead of as one non-BMP code point.
>
> Depends on your definition of "practically".
>
> Python does interpret them that way to maintain O(1) positional access
> within strings encoded with 16 bits/char.


Indexing does not try to interpret the string as code points at all, it
works on code units. The difference is easier to see if you imagine Python
using utf-8 for strings. Indexing would still work on (8-bit) code units
instead of code points. It is higher level operations such as
unicodedata.normalize() that need to interpret strings as code points.
For 16-bit code units there are two interpretations, depending on whether
you think that surrogate pairs mean one (UTF-16) or two (UCS-2) code points.

Incidentally, unicodedata.normalize() is an example that currently does
interpret its input as UCS-2 instead of UTF-16. If you pass it a surrogate
pair it thinks of them as two code points, and won't do any normalization
for anything outside BMP on a UCS-2 build. Another example would be
unichr(), which gives you TypeError if you pass it a surrogate pair (oddly
enough, as strings of different length are of the same type).
_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] String comparison

Reply via email to