On 14/07/17 15:14, Marko Rauhamaa wrote:
Rhodri James <rho...@kynesim.co.uk>:

On 14/07/17 14:31, Marko Rauhamaa wrote:
Of course, UTF-8 in a bytes object doesn't make the situation any
better, but does it make it any worse?

Speaking as someone who has been up to his elbows in this recently, I
would say emphatically that it does make things worse. It adds an
extra layer of complexity to all of the questions you were asking, and
more. A single codepoint is a meaningful thing, even if its meaning
may be modified by combining. A single byte may or may not be
meaningful.

I'd like to understand this better. Maybe you have a couple of examples
to share?

Sure.

What I've mostly been looking at recently has been the Expat XML parser. XML chooses to deal with one of your problems by defining that it's not having anything to do with combining, sequences of codepoints are all you need to worry about when comparing strings. U+00E8 (LATIN SMALL LETTER E WITH GRAVE) is not the same as U+0065 (LATIN SMALL LETTER E) followed by U+0300 (COMBINING GRAVE ACCENT) for example.

However Expat is written in C, and it reads in UTF-8 as a sequence of bytes. There are endless checks all over the code that complete UTF-8 byte sequences have been read in or passed across functional interfaces. When you are dealing with a bytestream like this, you cannot assume that have complete codepoints, and you cannot find codepoint boundaries without searching along the string. It's only once you have reconstructed the codepoint that you can tell what sort of character you have, and whether or not it is valid in your parsing context.

--
Rhodri James *-* Kynesim Ltd
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to