Re: [hackers] [sbase][PATCH 4/5] fold: fix handling of multibyte characters

Laslo Hunhold Wed, 30 Sep 2020 23:56:01 -0700

On Wed, 30 Sep 2020 22:41:47 -0700
Michael Forney <[email protected]> wrote:


Dear Michael,

> POSIX says we should be counting column positions rather than
> codepoints, but I think that might be rather difficult to get right
> and this is probably an improvement already.
> 
> I know Laslo has studied this area for libgrapheme, so maybe he has
> suggestions.

if you want to do it 100% right, there's no way around using
libgrapheme (or another library handling grapheme clusters like icu,
but I bet there's none nearly as lightweight as libgrapheme). Counting
codepoints is only halfway there and there are trivial counterexamples
which prove that this is not the complete solution and there are
discrepancies.

On the other hand, in the western world, most grapheme clusters are
emojis and certain cases with more complex writing systems. It's a much
different matter when you go to asia or africa, where you can't really
properly implement many very popular writing systems (like Hangul)
without using grapheme clusters.
Most importantly in general though are if you're processing
denormalized input (i.e. where everything is broken down as much as
possible, for example the single codepoint
(=1-codepoint-grapheme-cluster) "ä" is turned into the codepoint "a"
with an umlaut modifier, making it a 2-codepoint-grapheme-cluster),
leading to a lot of gotchas, inconsistencies and maybe even security
problems.

All in all though, codepoint-counting is a step in the right direction,
but definitely not exhaustive, especially as time moves on and more and
more people are using the higher unicode planes for data. If you really
want to do it right, you must handle grapheme clusters, and libgrapheme
is actually very fast and should even be faster than the Rune-solution
using libutf.h, because it works on the byte-level rather than the
Codepoint-level.

With best regards

Laslo

Re: [hackers] [sbase][PATCH 4/5] fold: fix handling of multibyte characters

Reply via email to