Re: Major performance problem with std.array.front()

Michel Fortin Sat, 08 Mar 2014 18:21:12 -0800

On 2014-03-08 23:50:43 +0000, Andrei Alexandrescu<[email protected]> said:

Graphemes do not appear to have a 1:1 mapping with dchars, and any
attempt to do so would likely be a giant mistake.


I think they may be comparable to dchar.

Dchar, aka code points, are much clearly defined than graphemes. Aquick search shows me there's more than one way to segment a stringinto graphemes. There's the legacy and extended boundary algorithms forgeneral processing, and then there are some tailored algorithms thatcan segment code points differently depending on the locale, or otherconsiderations.


Reference:
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

There are three examples of local-specific graphemes in the table inthe section linked above. "Ch" is one of them. Quoting Wikipedia: "Chis a digraph in the Latin script. It is treated as a letter of its ownin Chamorro, Czech, Slovak, Igbo, Quechua, Guarani, Welsh, Cornish,Breton and Belarusian Łacinka alphabets."

https://en.wikipedia.org/wiki/Ch_(digraph)

Also, there's some code points that represent ligatures (such as “ﬂ”),which are in theory two graphemes. I'm not sure that the generalalgorithm does with that, but the depending on what you're doing(counting characters? spell checking?) you might want to split it intwo.

So basically you just can't make make an algorithm capable of countingletters/graphemes/characters in a universal fashion. There's no suchthing as a universal grapheme segmentation algorithm, even though thereis a general one. It'd be wise for any API to expose this subtletywhenever segmenting graphemes.


Text is an interesting topic for never-ending discussions.

--
Michel Fortin
[email protected]
http://michelf.ca

Re: Major performance problem with std.array.front()

Reply via email to