On Saturday, 8 September 2018 at 15:36:25 UTC, Steven Schveighoffer wrote:
On 8/9/18 2:44 AM, Walter Bright wrote:


So it turns out that technically the problem here, even though it seemed like an autodecoding problem, is a problem with splitter.

splitter doesn't deal with encodings of character ranges at all.

For instance, when you have this:

"abc 123".byCodeUnit.splitter;

What happens is splitter only has one overload that takes one parameter, and that requires a character *array*, not a range.

So the byCodeUnit result is aliased-this to its original, and surprise! the elements from that splitter are string.

Next, I tried to use a parameter:

"abc 123".byCodeUnit.splitter(" ");

Nope, still devolves to string. It turns out it can't figure out how to split character ranges using a character array as input.

The only thing that does seem to work is this:

"abc 123".byCodeUnit.splitter(" ".byCodeUnit);


After a while your code will be cluttered with absurd stuff like this. `.byCodeUnit`, `.byGrapheme`, `.array` etc. Due to my experience with `splitter` et. al. I tried to create my own parser to have better control over every step. After a few *minutes* of testing things I ran into this bug [1] that didn't get fixed till early 2018. I never started to write my own step-by-step parser. I'm glad I didn't.

I wish people began to realize that string handling is a basic necessity and that the correct handling of strings is of utmost importance. Please keep us updated on how things work out (or not) for you.

[Please, nobody answer my post pointing out that a) we don't understand Unicode and b) that it's an insult to the Universe to draw attention to flaws that keep pestering us on an almost daily basis - without trying to fix them ourselves stante pede. As is clear from Steve's efforts, the Universe doesn't seem to care.)

[1] https://issues.dlang.org/show_bug.cgi?id=16739

[snip]

Reply via email to