On Thu, Jul 3, 2008 at 3:00 PM, Adam Olsen <[EMAIL PROTECTED]> wrote: > On Thu, Jul 3, 2008 at 3:01 PM, Terry Reedy <[EMAIL PROTECTED]> wrote: >> >> The premise is the OP's idea that Python should switch to all UCS4 to create >> a more pure ('ideal') situation or the idea that len(s) should count >> codepoints (correct term?) for all builds as a matter of purity even though >> on it would be time-costly on 16-bit builds as a matter of practicality. > > Wrong term - code units and code points are equivalent in UTF-16 and > UTF-32. What you're looking for is unicode scalar values.
I don't think so. I have in my lap the Unicode 5.0 standard, which on page 102, under UTF-16, states (amongst others): """ * In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is represented as <004D 0439 4E8C D800 DF02>, where <D800 DF02> corresponds to U+10302. * Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range D800[16]..DFFF[16] are ill-formed. """ >From this I understand they distinguish carefully between code points and code units -- D800 is a code unit but not a code point, 10302 is a code point but not a (UTF-16) code unit. OTOH outside the context of UTF-8, the surrogates are also referred to as "reserved code points" (e.g. in Table 2-3 on page 27, "Types of Code Points"). I think the best thing we can do is to use "code points" to refer to characters and "code units" to the individual 16-bit values in the UTF-16 encoding; this seems compatible with usage elsewhere in this thread by most folks. Also see http://unicode.org/glossary/: """ Code Point. Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) . . . Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms.) """ -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com