On Thursday, March 06, 2014 18:37:13 Walter Bright wrote: > Is there any hope of fixing this?
I agree with Andrei. I don't think that there's really anything to fix. The problem is that there's roughly 3 levels at which string operations can be done 1. By code unit 2. By code point 3. By grapheme and which is correct depends on what you're trying to do. Phobos attempts to go for correctness by default without seriously impacting performance, so it treats all strings as ranges of dchar (so, level #2). If we went with #1, then pretty much any algorithm which operated on individual characters would be broken, as unless your strings are ASCII-only, code units are very much the wrong level to be operating on if you're trying to deal with characters. If we went with #3, then we'd have full correctness, but we'd tank performance. With #2, we're far more correct than is typically the case with C++ while still being reasonably performant. And those who want full performance can use immutable(ubyte)[] to get #1, and those who want #3 can use the grapheme support in std.uni. We've gone to great lengths in Phobos to specialize on narrow strings in order to make it more efficient while still maintaining correctness, and anyone who really wants performance can do the same. But by operating on the code point level, we at least get a reasonable level of unicode-correctness by default. With your suggestion, I'd fully expect most D programs to be wrong with regards to Unicode, because most programmers don't know or care about how Unicode works. And changing what we're doing now would be code breakage of astronomical proportions. It will essentially break all uses of range-based string code. Certainly, it would be largest code breakage that D has seen is years if not ever. So, it's almost certainly a bad idea, but if it isn't, we need to be darn sure that what we change to is significantly better and worth the huge amount of code breakage that it will cause. I really don't think that there's any way to get this right. Regardless of which level you operate at by default - be it code unit, code point, or grapheme - it will be wrong a good chunk of the time. So, it becomes a question which of the three has the best tradeoffs, and I think that our current solution of operating on code points by default does that. If there are things that we can do to better support operating on code units or graphemes for those who want it, then great. And it's great if we can find ways to make operating at the code point level more efficient or less prone to bugs due to not operating at the grapheme level. But I think that operating on the code point level like we currently do is by far the best approach. If anything, it's the fact that the language doesn't do that that's a bigger concern IMHO - the main place where that's an issue being the fact that foreach iterates by code unit by default. But I don't know of a good way to solve that other than treating all arrays of char, wchar, and dchar specially, and disable their array operations like ranges do so that you have to convert them to code units via the representation function in order to operate on them as code units - which Andrei has suggested a number of times before, but you've shot him down each time. If that were fixed, then at least we'd be consistent, which is usually the biggest complaint with regards to how D treats strings. But I really don't think that there's a magical fix for range- based string operations, and I think that our current approach is a good one. - Jonathan M Davis
