Re: Grapheme clusters, a.k.a.real characters

Rhodri James Fri, 14 Jul 2017 08:01:17 -0700

On 14/07/17 15:14, Marko Rauhamaa wrote:

Rhodri James <[email protected]>:

On 14/07/17 14:31, Marko Rauhamaa wrote:

Of course, UTF-8 in a bytes object doesn't make the situation any
better, but does it make it any worse?


Speaking as someone who has been up to his elbows in this recently, I
would say emphatically that it does make things worse. It adds an
extra layer of complexity to all of the questions you were asking, and
more. A single codepoint is a meaningful thing, even if its meaning
may be modified by combining. A single byte may or may not be
meaningful.


I'd like to understand this better. Maybe you have a couple of examples
to share?


Sure.

What I've mostly been looking at recently has been the Expat XML parser.XML chooses to deal with one of your problems by defining that it'snot having anything to do with combining, sequences of codepoints areall you need to worry about when comparing strings. U+00E8 (LATIN SMALLLETTER E WITH GRAVE) is not the same as U+0065 (LATIN SMALL LETTER E)followed by U+0300 (COMBINING GRAVE ACCENT) for example.

However Expat is written in C, and it reads in UTF-8 as a sequence ofbytes. There are endless checks all over the code that complete UTF-8byte sequences have been read in or passed across functional interfaces.When you are dealing with a bytestream like this, you cannot assumethat have complete codepoints, and you cannot find codepoint boundarieswithout searching along the string. It's only once you havereconstructed the codepoint that you can tell what sort of character youhave, and whether or not it is valid in your parsing context.


--
Rhodri James *-* Kynesim Ltd
--
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

Reply via email to