On Thu, May 12, 2011 at 2:42 PM, Terry Reedy <tjre...@udel.edu> wrote: > On 5/12/2011 12:17 PM, Ian Kelly wrote: >> Right. *Under the hood* Python uses UCS-2 (which is not exactly the >> same thing as UTF-16, by the way) to represent Unicode strings. > > I know some people say that, but according to the definitions of the unicode > consortium, that is wrong! The earlier UCS-2 *cannot* represent chars in the > Supplementary Planes. The later (1996) UTF-16, which Python uses, can. The > standard considers 'UCS-2' obsolete long ago. See > > https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2 > or http://www.unicode.org/faq/basic_q.html#14
At the first link, in the section _Use in major operating systems and environments_ it states, "The Python language environment officially only uses UCS-2 internally since version 2.1, but the UTF-8 decoder to "Unicode" produces correct UTF-16. Python can be compiled to use UCS-4 (UTF-32) but this is commonly only done on Unix systems." PEP 100 says: The internal format for Unicode objects should use a Python specific fixed format <PythonUnicode> implemented as 'unsigned short' (or another unsigned numeric type having 16 bits). Byte order is platform dependent. This format will hold UTF-16 encodings of the corresponding Unicode ordinals. The Python Unicode implementation will address these values as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all currently defined Unicode character points. UTF-16 without surrogates provides access to about 64k characters and covers all characters in the Basic Multilingual Plane (BMP) of Unicode. It is the Codec's responsibility to ensure that the data they pass to the Unicode object constructor respects this assumption. The constructor does not check the data for Unicode compliance or use of surrogates. I'm getting out of my depth here, but that implies to me that while Python stores UTF-16 and can correctly encode/decode it to UTF-8, other codecs might only work correctly with UCS-2, and the unicode class itself ignores surrogate pairs. Although I'm not sure how much this might have changed since the original implementation, especially for Python 3. -- http://mail.python.org/mailman/listinfo/python-list