On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: > On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote: >> >> -On [20080703 15:00], M.-A. Lemburg ([EMAIL PROTECTED]) wrote: >>> >>> Unicode if full of combining code points - if you break such a sequence, >>> the output will be just as wrong; regardless of UCS2 vs. UCS4. >> >> In my opinion you are confusing two related, but very separated things >> here. >> Combining characters have nothing to do with breaking up the encoding of a >> single codepoint. Sure enough, if you arbitrary slice up codepoints that >> consist of combining characters then your result is indeed odd looking. >> >> I never said that nor is that the point I am making. > > Please remember that lone surrogate pair code points are perfectly > valid Unicode code points, nevertheless. Just as a lone combining > code point is valid on its own.
That is a big part of these problems. For all practical purposes, a surrogate is like a UTF-8 code unit, and must be handled the same way, so why the heck do they confuse everybody by saying "oh, it's a code point too!"? -- Adam Olsen, aka Rhamphoryncus _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com