On 2014-03-08 23:50:43 +0000, Andrei Alexandrescu <[email protected]> said:

Graphemes do not appear to have a 1:1 mapping with dchars, and any
attempt to do so would likely be a giant mistake.

I think they may be comparable to dchar.

Dchar, aka code points, are much clearly defined than graphemes. A quick search shows me there's more than one way to segment a string into graphemes. There's the legacy and extended boundary algorithms for general processing, and then there are some tailored algorithms that can segment code points differently depending on the locale, or other considerations.

Reference:
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

There are three examples of local-specific graphemes in the table in the section linked above. "Ch" is one of them. Quoting Wikipedia: "Ch is a digraph in the Latin script. It is treated as a letter of its own in Chamorro, Czech, Slovak, Igbo, Quechua, Guarani, Welsh, Cornish, Breton and Belarusian Łacinka alphabets."
https://en.wikipedia.org/wiki/Ch_(digraph)

Also, there's some code points that represent ligatures (such as “fl”), which are in theory two graphemes. I'm not sure that the general algorithm does with that, but the depending on what you're doing (counting characters? spell checking?) you might want to split it in two.

So basically you just can't make make an algorithm capable of counting letters/graphemes/characters in a universal fashion. There's no such thing as a universal grapheme segmentation algorithm, even though there is a general one. It'd be wise for any API to expose this subtlety whenever segmenting graphemes.

Text is an interesting topic for never-ending discussions.

--
Michel Fortin
[email protected]
http://michelf.ca

Reply via email to