Martin v. Löwis <martin <at> v.loewis.de> writes: > > > Wrong term - code units and code points are equivalent in UTF-16 and > > UTF-32. What you're looking for is unicode scalar values. > > How so? Section 2.5, UTF-16 says > > "code points in the supplementary planes, in the range > U+10000..U+10FFFF, are represented as pairs of 16-bit code units." > > So clearly, code points in Unicode range from U+0000..U+10FFFF, > independent of encoding form. > > In UTF-16, code units range from 0..65535. > > OTOH, "unicode scalar value" is nearly synonymous to "code point": > > D76 Unicode Scalar Value. Any Unicode code point except high-surrogate > and low-surrogate code points. > > So codepoint in Terry's message was the right term. >
No Terry did definitely mean Unicode scalar values. He was describing the "pure" but impractical "len()" that would count a surrogate pair as "1", not 2, even in the 32-bit builds. For what it is worth: Code point: a number between 0 and 1114111. Scalar Value: a code point, except the surrogate code points. Code unit: The basic unit of the encoding. One code unit is always sufficient to encode some Unicode Scalar values. However, other Unicode scalar values may require multiple Code units. Note that a scalar value is a code point. A code point may or may not be a scalar value. Practical len() returns the number of code units of the internal storage format. Pure len() allegedly would return the number of Unicode scalar values (obviously a surrogate pair would be considered a single Unicode scalar value). Please keep in mind that encodings encode Unicode scalar values. Thus a utf-8 code unit sequence (or UTF-32 code unit) that would give a code point in the surrogate sections is technically in error. (Although python would do well to ignore this restriction as there may be valid reasons to have a utf-8 sequence that is not a valid encoded Unicode text sequence) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com