On Thu, May 12, 2016 at 08:24:23PM +0000, Vladimir Panteleev via Digitalmars-d wrote: [...] > 12. The result of autodecoding, a range of Unicode code points, is > rarely actually useful, and code that relies on autodecoding is rarely > actually, universally correct. Graphemes are occasionally useful for a > subset of scripts, and a subset of that subset has all graphemes > mapped to single code points, but this only applies to some > scripts/languages. > > In the majority of cases, autodecoding provides only the illusion of > correctness.
A range of Unicode code points is not the same as a range of graphemes (a grapheme is what a layperson would consider to be a "character"). Autodecoding returns dchar, a code point, rather than a grapheme. Therefore, autodecoding actually only produces intuitively correct results when your string has a 1-to-1 correspondence between grapheme and code point. In general, this is only true for a small subset of languages, mainly a few common European languages and a handful of others. It doesn't work for Korean, and doesn't work for any language that uses combining diacritics or other modifiers. You need byGrapheme to have the correct results. So basically autodecoding, as currently implemented, fails to meet its goal of segmenting a string by "character" (i.e., grapheme), and yet imposes a performance penalty that is difficult to "turn off" (you have to sprinkle your code with byCodeUnit everywhere, and many Phobos algorithms just return a range of dchar anyway). Not to mention that a good number of string algorithms don't actually *need* autodecoding at all. (One could make a case for auto-segmenting by grapheme, but that's even worse in terms of performance (it requires a non-trivial Unicode algorithm involving lookup tables, and may need memory allocation). At the end of the day, we're back to square one: iterate by code unit, and explicitly ask for byGrapheme where necessary.) T -- "I'm running Windows '98." "Yes." "My computer isn't working now." "Yes, you already said that." -- User-Friendly
