On Thu, Jul 3, 2008 at 11:35 AM, Jeroen Ruigrok van der Werven <[EMAIL PROTECTED]> wrote: > -On [20080703 19:21], Adam Olsen ([EMAIL PROTECTED]) wrote: >>On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: >>> Please remember that lone surrogate pair code points are perfectly >>> valid Unicode code points, nevertheless. Just as a lone combining >>> code point is valid on its own. >> >>That is a big part of these problems. For all practical purposes, a >>surrogate is like a UTF-8 code unit, and must be handled the same way, >>so why the heck do they confuse everybody by saying "oh, it's a code >>point too!"? > > Because surrogate code points are not Unicode scalar values, isolated UTF-16 > code units in the range 0xd800-0xdfff are ill-formed. (D91 from Unicode > 5.0/5.1, section 3.9) > > So, no, it is not a code point too.
UTF-16 D91 UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5. • In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is represented as <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds to U+10302. • Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range D80016..DFFF16 are ill-formed. In the context of UTF-8 or UTF-32, a Unicode scalar value is a single code point of a valid character (more or less) and a code unit is the base unit (1 and 4 bytes respectively) of which 1 or more combine to form a code point. In UTF-16, code point becomes synonymous with code unit and Unicode scalar value becomes one or more code points. WTF? -- Adam Olsen, aka Rhamphoryncus _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com