On Thursday, June 02, 2016 18:23:19 Andrei Alexandrescu via Digitalmars-d wrote: > On 06/02/2016 05:58 PM, Walter Bright wrote: > > On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote: > >> The lambda returns bool. -- Andrei > > > > Yes, I was wrong about that. But the point still stands with: > > > * s.balancedParens('〈', '〉') works only with autodecoding. > > > * s.canFind('ö') works only with autodecoding. It returns always > > > > false without. > > > > Can be made to work without autodecoding. > > By special casing? Perhaps. I seem to recall though that one major issue > with autodecoding was that it special-cases certain algorithms. So you'd > need to go through all of std.algorithm and make sure you can > special-case your way out of situations that work today.
Yeah, I believe that you do have to do some special casing, though it would be special casing on ranges of code units in general and not strings specifically, and a lot of those functions are already special cased on string in an attempt be efficient. In particular, with a function like find or canFind, you'd take the needle and encode it to match the haystack it was passed so that you can do the comparisons via code units. So, you incur the encoding cost once when encoding the needle rather than incurring the decoding cost of each code point or grapheme as you iterate over the haystack. So, you end up with something that's correct and efficient. It's also much friendlier to code that only operates on ASCII. The one issue that I'm not quite sure how we'd handle in that case is normalization (which auto-decoding doesn't handle either), since you'd need to normalize the needle to match the haystack (which also assumes that the haystack was already normalized). Certainly, it's the sort of thing that makes it so that you kind of wish you were dealing with a string type that had the normalization built into it rather than either an array of code units or an arbitrary range of code units. But maybe we could assume the NFC normalization like std.uni.normalize does and provide an optional template argument for the normalization scheme. In any case, while it's not entirely straightforward, it is quite possible to write some algorithms in a way which works on arbitrary ranges of code units and deals with Unicode correctly without auto-decoding or requiring that the user convert it to a range of code points or graphemes in order to properly handle the full range of Unicode. And even if we keep auto-decoding, we pretty much need to fix it so that std.algorithm and friends are Unicode-aware in this manner so that ranges of code units work in general without requiring that you use byGrapheme. So, this sort of thing could have a large impact on RCStr, even if we keep auto-decoding for narrow strings. Other algorithms, however, can't be made to work automatically with Unicode - at least not with the current range paradigm. filter, for instance, really needs to operate on graphemes to filter on characters, but with a range of code units, that would mean operating on groups of code units as a single element, which you can't do with something like a range of char, since that essentially becomes a range of ranges. It has to be wrapped in a range that's going to provide graphemes - and of course, if you know that you're operating only on ASCII, then you wouldn't want to deal with graphemes anyway, so automatically converting to graphemes would be undesirable. So, for a function like filter, it really does have to be up to the programmer to indicate what level of Unicode they want to operate at. But if we don't make functions Unicode-aware where possible, then we're going to take a performance hit by essentially forcing everyone to use explicit ranges of code points or graphemes even when they should be unnecessary. So, I think that we're stuck with some level of special casing, but it would then be for ranges of code units and code points and not strings. So, it would work efficiently for stuff like RCStr, which the current scheme does not. I think that the reality of the matter is that regardless of whether we keep auto-decoding for narrow strings in place, we need to make Phobos operate on arbitrary ranges of code units and code points, since even stuff like RCStr won't work efficiently otherwise, and stuff like byCodeUnit won't be usuable in as many cases otherwise, because if a generic function isn't Unicode-aware, then in many cases, byCodeUnit will be very wrong, just like byCodePoint would be wrong. So, as far as Phobos goes, I'm not sure that the question of auto-decoding matters much for what we need to do at this point. If we do what we need to do, then Phobos will work whether we have auto-decoding or not (working in a Unicode-aware manner where possible and forcing the user to decide the correct level of Unicode to work at where not), and then it just becomes a question of whether we can or should deprecate auto-decoding once all that's done. - Jonathan M Davis