On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote: > On 6/12/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > > On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote: > > > Practically speaking, there's little need to interpret > > > surrogate pairs as two code points instead of as one > > > non-BMP code point.
> > Depends on your definition of "practically". > > Python does interpret them that way to maintain O(1) positional > > access within strings encoded with 16 bits/char. > Indexing does not try to interpret the string as code points at all, it > works on code units. Even assuming that (when most people will assume "letters", and could maybe understand that accent marks sometimes count), it still doesn't quite work. Slicing (or iterating over) a string claims to return strings of the same type. >>> for x in u"abc": print type(x) <type 'unicode'> <type 'unicode'> <type 'unicode'> Strictly speaking, the surrogate pairs should be returned together, rather that as separate code units. It probably won't be fixed, since those who care most are probably using 4-byte unicode characters. -jJ _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
