Re: [Python-Dev] len(chr(i)) = 2?

Martin v. Löwis Mon, 22 Nov 2010 03:45:42 -0800

Am 22.11.2010 11:48, schrieb Stephen J. Turnbull:
> Raymond Hettinger writes:
> 
>  > Neither UTF-16 nor UCS-2 is exactly correct anyway.
> 
>>From a standards lawyer point of view, UCS-2 is exactly correct, as
> far as I can tell upon rereading ISO 10646-1, especially Annexes H
> ("retransmitting devices") and Q ("UTF-16").  Annex Q makes it clear
> that UTF-16 was intentionally designed so that Python-style processing
> could be done in a UCS-2 context.


I could only find the FCD of 10646:2010, where annex H was integrated
into section 10:

http://www.itscj.ipsj.or.jp/sc2/open/02n4125/FCD10646-Main.pdf

There they have stopped using the term UCS-2, and added a note

# NOTE – Former editions of this standard included references to a
# two-octet BMP form called UCS-2 which would be a subset
# of the UTF-16 encoding form restricted to the BMP UCS scalar values. #
The UCS-2 form is deprecated.

I think they are now acknowledging that UCS-2 was a misleading term,
making it ambiguous whether this refers to a CCS, a CEF, or a CES;
like "ASCII", people have been using it for all three of them.

Apparently, the ISO WG interprets earlier revisions as saying that
UCS-2 is a CEF that restricted UTF-16 to the BMP. THIS IS NOT WHAT
PYTHON DOES. In a narrow Python build, the character set is *not*
restricted to the BMP. Instead, Unicode strings are meant to be
interpreted (by applications) as UTF-16.

>  > For the "wide" build, the entire range of unicode is encoded at
>  > 4 bytes per character and slicing/len operate correctly since
>  > every character is the same length.   This used to be called UCS-4
>  > and is now UTF-32.
> 
> That's inaccurate, I believe.  UCS-4 is not a UTF, and doesn't satisfy
> the range restrictions of a UTF.

Not sure what it says in your copy; in mine, section 9.3 says

# 9.3 UTF-32 (UCS-4)
# UTF-32 (or UCS-4) is the UCS encoding form that assigns each UCS
# scalar value to a single unsigned 32-bit code unit. The terms UTF-32 #
and UCS-4 can be used interchangeably to designate this encoding
# form.

so they (now) view the two as synonyms.

I think that when ISO 10646 started, they were also fairly confused
about these issues (as the group/plane/row/cell structure demonstrates,
IMO). This is not surprising, since the notion of byte-based character
sets had been ingrained for so long. It took 20 years to learn that
a UCS scalar value really is *not* a sequence of bytes, but a natural
number.

> However, I don't see how "narrow" tells us more than "UCS-2" does.  If
> "UCS-2" is equally (or more) informative, I prefer it because it is
> the technically precise, already well-defined, term.

But it's not. It is a confusing term, one that the relevant standards
bodies are abandoning. After reading FCD 10646:2010, I could agree to
call the two implementations UTF-16 and UTF-32 (as these terms
designate CEFs). Unfortunately, they also designate CESs.

> If we have to document what the terms we choose mean anyway, why not
> document the existing terms and reduce entropy, rather than invent new
> ones and increase entropy?

Because the proposed existing term is deprecated.

Regards,
Martin
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] len(chr(i)) = 2?

Reply via email to