On Saturday, 8 September 2018 at 15:36:25 UTC, Steven
Schveighoffer wrote:
On 8/9/18 2:44 AM, Walter Bright wrote:
So it turns out that technically the problem here, even though
it seemed like an autodecoding problem, is a problem with
splitter.
splitter doesn't deal with encodings of character ranges at all.
For instance, when you have this:
"abc 123".byCodeUnit.splitter;
What happens is splitter only has one overload that takes one
parameter, and that requires a character *array*, not a range.
So the byCodeUnit result is aliased-this to its original, and
surprise! the elements from that splitter are string.
Next, I tried to use a parameter:
"abc 123".byCodeUnit.splitter(" ");
Nope, still devolves to string. It turns out it can't figure
out how to split character ranges using a character array as
input.
The only thing that does seem to work is this:
"abc 123".byCodeUnit.splitter(" ".byCodeUnit);
After a while your code will be cluttered with absurd stuff like
this. `.byCodeUnit`, `.byGrapheme`, `.array` etc. Due to my
experience with `splitter` et. al. I tried to create my own
parser to have better control over every step. After a few
*minutes* of testing things I ran into this bug [1] that didn't
get fixed till early 2018. I never started to write my own
step-by-step parser. I'm glad I didn't.
I wish people began to realize that string handling is a basic
necessity and that the correct handling of strings is of utmost
importance. Please keep us updated on how things work out (or
not) for you.
[Please, nobody answer my post pointing out that a) we don't
understand Unicode and b) that it's an insult to the Universe to
draw attention to flaws that keep pestering us on an almost daily
basis - without trying to fix them ourselves stante pede. As is
clear from Steve's efforts, the Universe doesn't seem to care.)
[1] https://issues.dlang.org/show_bug.cgi?id=16739
[snip]