Am 22.11.2010 11:48, schrieb Stephen J. Turnbull:
> Raymond Hettinger writes:
>
> > Neither UTF-16 nor UCS-2 is exactly correct anyway.
>
>>From a standards lawyer point of view, UCS-2 is exactly correct, as
> far as I can tell upon rereading ISO 10646-1, especially Annexes H
> ("retransmitting devices") and Q ("UTF-16"). Annex Q makes it clear
> that UTF-16 was intentionally designed so that Python-style processing
> could be done in a UCS-2 context.
I could only find the FCD of 10646:2010, where annex H was integrated
into section 10:
http://www.itscj.ipsj.or.jp/sc2/open/02n4125/FCD10646-Main.pdf
There they have stopped using the term UCS-2, and added a note
# NOTE – Former editions of this standard included references to a
# two-octet BMP form called UCS-2 which would be a subset
# of the UTF-16 encoding form restricted to the BMP UCS scalar values. #
The UCS-2 form is deprecated.
I think they are now acknowledging that UCS-2 was a misleading term,
making it ambiguous whether this refers to a CCS, a CEF, or a CES;
like "ASCII", people have been using it for all three of them.
Apparently, the ISO WG interprets earlier revisions as saying that
UCS-2 is a CEF that restricted UTF-16 to the BMP. THIS IS NOT WHAT
PYTHON DOES. In a narrow Python build, the character set is *not*
restricted to the BMP. Instead, Unicode strings are meant to be
interpreted (by applications) as UTF-16.
> > For the "wide" build, the entire range of unicode is encoded at
> > 4 bytes per character and slicing/len operate correctly since
> > every character is the same length. This used to be called UCS-4
> > and is now UTF-32.
>
> That's inaccurate, I believe. UCS-4 is not a UTF, and doesn't satisfy
> the range restrictions of a UTF.
Not sure what it says in your copy; in mine, section 9.3 says
# 9.3 UTF-32 (UCS-4)
# UTF-32 (or UCS-4) is the UCS encoding form that assigns each UCS
# scalar value to a single unsigned 32-bit code unit. The terms UTF-32 #
and UCS-4 can be used interchangeably to designate this encoding
# form.
so they (now) view the two as synonyms.
I think that when ISO 10646 started, they were also fairly confused
about these issues (as the group/plane/row/cell structure demonstrates,
IMO). This is not surprising, since the notion of byte-based character
sets had been ingrained for so long. It took 20 years to learn that
a UCS scalar value really is *not* a sequence of bytes, but a natural
number.
> However, I don't see how "narrow" tells us more than "UCS-2" does. If
> "UCS-2" is equally (or more) informative, I prefer it because it is
> the technically precise, already well-defined, term.
But it's not. It is a confusing term, one that the relevant standards
bodies are abandoning. After reading FCD 10646:2010, I could agree to
call the two implementations UTF-16 and UTF-32 (as these terms
designate CEFs). Unfortunately, they also designate CESs.
> If we have to document what the terms we choose mean anyway, why not
> document the existing terms and reduce entropy, rather than invent new
> ones and increase entropy?
Because the proposed existing term is deprecated.
Regards,
Martin
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com