On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393: > The change does not just benefit ASCII users. It primarily benefits > anybody using a wide unicode build with strings mostly containing only > BMP characters.
Just to be clear: If you have many strings which are *mostly* BMP, but have one or two non- BMP characters in *each* string, you will see no benefit. But if you have many strings which are all BMP, and only a few strings containing non-BMP characters, then you will see a big benefit. > Even for narrow build users, there is the benefit that > with approximately the same amount of memory usage in most cases, they > no longer have to worry about non-BMP characters sneaking in and > breaking their code. Yes! +1000 on that. > There is some additional benefit for Latin-1 users, but this has nothing > to do with Python. If Python is going to have the option of a 1-byte > representation (and as long as we have the flexible representation, I > can see no reason not to), The PEP explicitly states that it only uses a 1-byte format for ASCII strings, not Latin-1: "ASCII-only Unicode strings will again use only one byte per character" and later: "If the maximum character is less than 128, they use the PyASCIIObject structure" and: "The data and utf8 pointers point to the same memory if the string uses only ASCII characters (using only Latin-1 is not sufficient)." > then it is going to be Latin-1 by definition, Certainly not, either in fact or in principle. There are a large number of 1-byte encodings, Latin-1 is hardly the only one. > because that's what 1-byte Unicode (UCS-1, if you will) is. If you have > an issue with that, take it up with the designers of Unicode. The designers of Unicode have never created a standard "1-byte Unicode" or UCS-1, as far as I can determine. The Unicode standard refers to some multiple million code points, far too many to fit in a single byte. There is some historical justification for using "Unicode" to mean UCS-2, but with the standard being extended beyond the BMP, that is no longer valid. See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more details. I think what you are trying to say is that the Unicode designers deliberately matched the Latin-1 standard for Unicode's first 256 code points. That's not the same thing though: there is no Unicode standard mapping to a single byte format. -- Steven -- http://mail.python.org/mailman/listinfo/python-list