(Resending this to the list because I previously sent it only to Steven by mistake. Also showing off a case where top-posting is reasonable, since this bit requires no context. :-)
On Sat, Aug 18, 2012 at 1:41 AM, Ian Kelly <ian.g.ke...@gmail.com> wrote: > > On Aug 17, 2012 10:17 PM, "Steven D'Aprano" > <steve+comp.lang.pyt...@pearwood.info> wrote: >> >> Unicode strings are not represented as Latin-1 internally. Latin-1 is a >> byte encoding, not a unicode internal format. Perhaps you mean to say >> that they are represented as a single byte format? > > They are represented as a single-byte format that happens to be equivalent > to Latin-1, because Latin-1 is a proper subset of Unicode; every character > representable in Latin-1 has a byte value equal to its Unicode codepoint. > This talk of whether it's a byte encoding or a 1-byte Unicode representation > is then just semantics. Even the PEP refers to the 1-byte representation as > Latin-1. > >> >> >> I understand the complaint >> >> to be that while the change is great for strings that happen to fit in >> >> Latin-1, it is less efficient than previous versions for strings that >> >> do not. >> > >> > That's not the way I interpreted the PEP 393. It takes a pure unicode >> > string, finds the largest code point in that string, and chooses 1, 2 or >> > 4 bytes for every character, based on how many bits it'd take for that >> > largest code point. >> >> That's how I interpret it too. > > I don't see how this is any different from what I described. Using all 4 > bytes of the code point, you get UCS-4. Truncating to 2 bytes, you get > UCS-2. Truncating to 1 byte, you get Latin-1. -- http://mail.python.org/mailman/listinfo/python-list