On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393: >> There is some additional benefit for Latin-1 users, but this has nothing >> to do with Python. If Python is going to have the option of a 1-byte >> representation (and as long as we have the flexible representation, I >> can see no reason not to), > > The PEP explicitly states that it only uses a 1-byte format for ASCII > strings, not Latin-1:
I think you misunderstand the PEP then, because that is empirically false. Python 3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:23:35) [MSC v.1600 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.getsizeof(bytes(range(256)).decode('latin1')) 329 The constructed string contains all 256 Latin-1 characters, so if Latin-1 strings must be stored in the 2-byte format, then the size should be at least 512 bytes. It is not, so I think it must be using the 1-byte encoding. > "ASCII-only Unicode strings will again use only one byte per character" This says nothing one way or the other about non-ASCII Latin-1 strings. > "If the maximum character is less than 128, they use the PyASCIIObject > structure" Note that this only describes the structure of "compact" string objects, which I have to admit I do not fully understand from the PEP. The wording suggests that it only uses the PyASCIIObject structure, not the derived structures. It then says that for compact ASCII strings "the UTF-8 data, the UTF-8 length and the wstr length are the same as the length of the ASCII data." But these fields are part of the PyCompactUnicodeObject structure, not the base PyASCIIObject structure, so they would not exist if only PyASCIIObject were used. It would also imply that compact non-ASCII strings are stored internally as UTF-8, which would be surprising. > and: > > "The data and utf8 pointers point to the same memory if the string uses > only ASCII characters (using only Latin-1 is not sufficient)." This says that if the data are ASCII, then the 1-byte representation and the utf8 pointer will share the same memory. It does not imply that the 1-byte representation is not used for Latin-1, only that it cannot also share memory with the utf8 pointer. -- http://mail.python.org/mailman/listinfo/python-list