"Martin v. Löwis" writes: > The term "UCS-2" is a character set that can encode only encode 65536 > characters; it thus refers to Unicode 1.1. According to the Unicode > Consortium's FAQ, the term UCS-2 should be avoided these days.
So what do you propose we call the Python implementation? You can call it "code-unit-oriented" if you like, but in fact it is identical to UCS-2 for all non-hairsplitting purposes. AFAICS the Unicode Consortium deprecates the *term* UCS-2 because they would like us to avoid *implementations* that don't encode the full Unicode character set, not because the term is technically incorrect. Strictly speaking, internally Python only encodes 65536 characters in 2-octet builds. Its (Unicode) string-handling code does not know about surrogates at all, AFAIK, and therefore is not UTF-16 conforming. (The anomolies discussed here are type transformations, not string-handling, for my purpose.) I really don't see why we shouldn't call a UCS-2 implementation by its name. AFAIK this was not supposed to change in Python 3; indexing and slicing go by code unit (isomorphic to UCS-n), not character, and due to PEP 383 4-octet builds do not conform (internally) to UTF-32, and can produce output that conforms to Unicode not at all (as a user option, of course, but it's still non-conformant). > > IMO, we should go back to the Python2 terms UCS2 and UCS4 which > > are correct and provide a clear description of what Python uses > > internally for code units. > > No, we shouldn't. The term UCS-2 is deprecated, see above. Too bad for the Unicode Consortium, I say. UCS-2 is the closest term that folks who are not Unicode geeks will have a chance of understanding. I agree with Marc-Andre that "narrow" and "wide" are too ambiguous to be useful. Many people will interpret that as "UTF-16" (or even "UTF-8") and "UTF-32", respectively, which is dead wrong. Others won't have a clue. Using "UCS-2" and "UCS-4" has the correct connotations to Unicode geeks, and they are easy to look up for non-geeks who care about precise definitions. Cf. the second half of the FAQ you quote: Instead, "UCS-2" has sometimes been used in the past to indicate that an implementation does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not handle processing like character properties, codepoint boundaries, collation, etc. for supplementary characters. "Hey, Python, I'm looking at you!" (Strictly speaking, Python libraries do some of that for us, but the Python *language* does not.) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com