On Tuesday, May 31, 2016 20:38:14 Nick Sabalausky via Digitalmars-d wrote: > On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote: > > On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote: > >> On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote: > >>> Let's put the question this way. Given the following string, what do > >>> *you* think walkLength should return? > >>> > >>> şŭt̥ḛ́k̠ > >> > >> The number of code units in the string. That's the contract promised and > >> honored by Phobos. -- Andrei > > > > Code points I mean. -- Andrei > > Yes, we know it's the contract. ***That's the problem.*** As everybody > is saying, it *SHOULDN'T* be the contract. > > Why shouldn't it be the contract? Because it's proven itself, both > logically (as presented by pretty much everybody other than you in both > this and other threads) and empirically (in phobos, warp, and other user > code) to be both the least useful and most PITA option.
Exactly. Operating at the code point level rarely makes sense. What sorts of algorithms purposefully do that in a typical program? Unless you're doing very specific Unicode stuff or somehow know that your strings don't contain any graphemes that are made up of multiple code points, operating at the code point level is just bug-prone, and unless you're using dchar[] everywhere, it's slow to boot, because you're strings have to be decoded whether the algorithm needs to or not. I think that it's very safe to say that the vast majority of string algorithms are either able to operate at the code unit level without decoding (though possibly encoding another string to match - e.g. with a string comparison or search), or they have to operate at the grapheme level in order to deal with full characters. A code point is borderline useless on its own. It's just a step above the different UTF encodings without actually getting to proper characters. - Jonathan M Davis
