On Thu, 28 Mar 2013 10:11:59 -0600, Ian Kelly wrote: > On Thu, Mar 28, 2013 at 8:38 AM, Chris Angelico <ros...@gmail.com> > wrote: >> PEP393 strings have two optimizations, or kinda three: >> >> 1a) ASCII-only strings >> 1b) Latin1-only strings >> 2) BMP-only strings >> 3) Everything else >> >> Options 1a and 1b are almost identical - I'm not sure what the detail >> is, but there's something flagging those strings that fit inside seven >> bits. (Something to do with optimizing encodings later?) Both are >> optimized down to a single byte per character. > > The only difference for ASCII-only strings is that they are kept in a > struct with a smaller header. The smaller header omits the utf8 pointer > (which optionally points to an additional UTF-8 representation of the > string) and its associated length variable. These are not needed for > ASCII-only strings because an ASCII string can be directly interpreted > as a UTF-8 string for the same result. The smaller header also omits > the "wstr_length" field which, according to the PEP, "differs from > length only if there are surrogate pairs in the representation." For an > ASCII string, of course there would not be any surrogate pairs.
I wonder why they need care about surrogate pairs? ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only strings. It's only strings in the SMPs that could need surrogate pairs, and they don't need them in Python's implementation since it's a full 32- bit implementation. So where do the surrogate pairs come into this? I also wonder why the implementation bothers keeping a UTF-8 representation. That sounds like premature optimization to me. Surely you only need it when writing to a file with UTF-8 encoding? For most strings, that will never happen. -- Steven -- http://mail.python.org/mailman/listinfo/python-list