On Wed, 30 Sep 2020 22:41:47 -0700 Michael Forney <[email protected]> wrote:
Dear Michael, > POSIX says we should be counting column positions rather than > codepoints, but I think that might be rather difficult to get right > and this is probably an improvement already. > > I know Laslo has studied this area for libgrapheme, so maybe he has > suggestions. if you want to do it 100% right, there's no way around using libgrapheme (or another library handling grapheme clusters like icu, but I bet there's none nearly as lightweight as libgrapheme). Counting codepoints is only halfway there and there are trivial counterexamples which prove that this is not the complete solution and there are discrepancies. On the other hand, in the western world, most grapheme clusters are emojis and certain cases with more complex writing systems. It's a much different matter when you go to asia or africa, where you can't really properly implement many very popular writing systems (like Hangul) without using grapheme clusters. Most importantly in general though are if you're processing denormalized input (i.e. where everything is broken down as much as possible, for example the single codepoint (=1-codepoint-grapheme-cluster) "รค" is turned into the codepoint "a" with an umlaut modifier, making it a 2-codepoint-grapheme-cluster), leading to a lot of gotchas, inconsistencies and maybe even security problems. All in all though, codepoint-counting is a step in the right direction, but definitely not exhaustive, especially as time moves on and more and more people are using the higher unicode planes for data. If you really want to do it right, you must handle grapheme clusters, and libgrapheme is actually very fast and should even be faster than the Rune-solution using libutf.h, because it works on the byte-level rather than the Codepoint-level. With best regards Laslo
