On 6/13/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > >> Until one or more of the senior developers says otherwise, I'm going > >> to assume that. > > > > Yeah, what's the difference between code units and points? > > A code unit is the atomic base in some encoding. It is a single byte > in most encodings, but a 16-bit quantity in UTF-16 (and a 32-bit > quantity in UTF-32). > > A code point is something that has a 1:1 relationship with a logical > character (in particular, a Unicode character). > > In UCS-2, a code point can be represented in 16 bits, and you can > represent all BMP characters. The low and high surrogates don't > encode characters and are reserved. > > In UCS-4, you need more than 16 bits to represent a code point. > For example, you might use UTF-16, where you can use a single > code unit for all BMP characters, and two of them for code points > above U+FFFF. > > Ever since PEP 261, Python admits that the elements of a Unicode > string are code units, and that you might need more than one of > them (specifically, for non-BMP characters in a narrow build) > to represent a code point.
Thanks for clearing that up. It sounds like we really use code units, not code points (except when building with the 4-byte Unicode option, when they are equivalent). Is there anywhere were we use code points, apart from the UTF-8 codecs, which encode properly matched surrogate pairs as a single code point? Is it correct to say that a surrogate in UCS-16 is two code units representing a single code point? Apart from the surrogates, are there code points that aren't characters? Are there characters that don't have a representation as a single code point? (I know some characters have multiple representations, some of which use multiple code points.) -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com