On Saturday, September 8, 2018 9:36:25 AM MDT Steven Schveighoffer via Digitalmars-d wrote: > On 8/9/18 2:44 AM, Walter Bright wrote: > > On 8/8/2018 2:01 PM, Steven Schveighoffer wrote: > >> Here's where I'm struggling -- because a string provides indexing, > >> slicing, length, etc. but Phobos ignores that. I can't make a new type > >> that does the same thing. Not only that, but I'm finding the > >> specializations of algorithms only work on the type "string", and > >> nothing else. > > > > One of the worst things about autodecoding is it is special, it *only* > > steps in for strings. Fortunately, however, that specialness enabled us > > to save things with byCodePoint and byCodeUnit. > > So it turns out that technically the problem here, even though it seemed > like an autodecoding problem, is a problem with splitter. > > splitter doesn't deal with encodings of character ranges at all. > > For instance, when you have this: > > "abc 123".byCodeUnit.splitter; > > What happens is splitter only has one overload that takes one parameter, > and that requires a character *array*, not a range. > > So the byCodeUnit result is aliased-this to its original, and surprise! > the elements from that splitter are string. > > Next, I tried to use a parameter: > > "abc 123".byCodeUnit.splitter(" "); > > Nope, still devolves to string. It turns out it can't figure out how to > split character ranges using a character array as input. > > The only thing that does seem to work is this: > > "abc 123".byCodeUnit.splitter(" ".byCodeUnit); > > But this goes against most algorithms in Phobos that deal with character > ranges -- generally you can use any width character range, and it just > works. Having a drop-in replacement for string would require splitter to > handle these transcodings (and I think in general, algorithms should be > able to handle them as well). Not only that, but the specialized > splitter that takes no separator can split on multiple spaces, a feature > I want to have for my drop-in replacement. > > I'll work on adding some issues to the tracker, and potentially doing > some PRs so they can be fixed.
Well, plenty of algorithms don't care one whit about strings specifically and thus their behavior is really dependent on what the element type of the range is (e.g. for byCodeUnit, filter would filter code units, and sort would sort code units, and arguably, that's what they should do). However, a big problem with with a number of the functions in Phobos that specifically operate on ranges of characters is that they tend to assume that a range of characters means a range of dchar. Some of the functions in Phobos have been fixed to be more flexible and operate on arbitrary ranges of char, wchar, or dchar, but it's mostly happened because of a bug report about a particular function not working with something like byCodeUnit, whereas what we really need to happen is to have tests added for all of the functions in Phobos which specifically operate on ranges of characters to ensure that they do the correct thing when given a range of char, wchar, dchar - or graphemes (much as we talk about graphemes being the correct level for a some types of string processing, nothing in Phobos outside of std.uni currently does anything with byGrapheme, even in tests). And of course, with those tests, we'll inevitably find that a number of those functions won't work correctly and will need to be fixed. But as annoying as all of that is, it's work that needs to be done regardless of the situation with auto-decoding, since these functions need to work with arbitrary ranges of characters and not just ranges of dchar. And for those functions that don't need to try to avoid auto-decoding, they should then not even care whether strings are ranges of code units or code points, which should then reduce the impact of auto-decoding. And actually, a lot of the code that specializes on narrow strings to avoid auto-decoding would probably work whether auto-decoding was there or not. So, once we've actually managed to ensure that Phobos in general works with arbitrary ranges of characters, the main breakage that would be caused by removing auto-decoding (in Phobos at least) would be any code that used strings with functions that weren't specifically written to do something special for strings, and while I'm not at all convinced that we then have a path towards removing auto-decoding, it would minimize auto-decoding's impact, and with auto-decoding's impact minimized as much as possible, maybe at some point, we'll actually manage to figure out how to remove it. But in any case, the issues that you're running into with splitter are a symptom of a larger problem with how Phobos currently handles ranges of characters. And when this sort of thing comes up, I'm reminded that I should take the time to start adding the appropriate tests to Phobos, and then I never get around to it - as with too many things. I really should fix that. :| - Jonathan M Davis