> Thanks for clearing that up. It sounds like we really use code units, > not code points (except when building with the 4-byte Unicode option, > when they are equivalent). Is there anywhere were we use code points, > apart from the UTF-8 codecs, which encode properly matched surrogate > pairs as a single code point?
The literal syntax also supports it: \U00010000 is supported even in a narrow build, and gets transparently encoded to the corresponding two code units; likewise for repr(). There is an SF patch to make unicodedata.lookup suport them also. > Is it correct to say that a surrogate in UCS-16 is two code units > representing a single code point? That's my understanding, yes. > Apart from the surrogates, are there code points that aren't > characters? Are there characters that don't have a representation as a > single code point? (I know some characters have multiple > representations, some of which use multiple code points.) [assuming you mean "code unit" again] Not in the Unicode type, no. In the byte string type, this happens all the time with multi-byte encodings. [assuming you really mean "code point" in the first question] There are numerous unassigned code points in Unicode, i.e. they don't represent a character *yet*. There are also several code points that are "noncharacters", in particular U+FFFE and U+FFFF. These are permanently reserved and should never be interpreted as abstract characters (rule C5). FFFE is reserved because it is the byte-toggled BOM; I believe FFFF is reserved so that APIs can use -1 as an error value. (FWIW, U+FFFD *is* assigned and means "REPLACEMENT CHARACTER", �). As for "combining characters": I think the Unicode terminology really is that they are separate characters. They get combined into a single grapheme, and different character sequences might be considered as equivalent under canonical forms - but the decomposed ö (o + combining diaeresis) actually is understood as a two-character (i.e. two-codepoint) sequence. Whether that matches the intuitive definition of "character", I don't know - and I'm sure somebody will correct me if I presented it incorrectly. Regards, Martin _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com