>> Until one or more of the senior developers says otherwise, I'm going >> to assume that. > > Yeah, what's the difference between code units and points?
A code unit is the atomic base in some encoding. It is a single byte in most encodings, but a 16-bit quantity in UTF-16 (and a 32-bit quantity in UTF-32). A code point is something that has a 1:1 relationship with a logical character (in particular, a Unicode character). In UCS-2, a code point can be represented in 16 bits, and you can represent all BMP characters. The low and high surrogates don't encode characters and are reserved. In UCS-4, you need more than 16 bits to represent a code point. For example, you might use UTF-16, where you can use a single code unit for all BMP characters, and two of them for code points above U+FFFF. Ever since PEP 261, Python admits that the elements of a Unicode string are code units, and that you might need more than one of them (specifically, for non-BMP characters in a narrow build) to represent a code point. Regards, Martin _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com