"Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > Josiah Carlson schrieb: > > For me, having recently remembered what was in a unicode string, and > > verifying it by checking the source, the question in my mind is whether > > we want to stick with the same 2-representation implementation (default > > encoding and UTF-16 or UCS-4 depending on build), or go with more or > > fewer representations. > > I would personally like to see a Python API that operates on code > points, with support for 17 planes. I also think that efficient indexing > is important.
Fully-featured unicode would be nice. > There are trade-offs, of course. I personally think the best trade-off > would be to have a two-byte representation, along with a flag telling > whether there are any surrogate pairs in the string. Indexing and > length would be constant-time if there are no surrogates, and linear > time if there are. What about a tree structure over the top of the string as I described in another post? If there are no surrogate pairs, the pointer to the tree is null. If there are surrogate pairs, we could either use the structure as I described, or even modify it so that we get even better memory utilization/performance (choose tree nodes based on where surrogate pairs are, up to some limit). - Josiah _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
