On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote:
instead, it should use standard library algorithms for searching, matching etc. When needed, iterating every code unit is trivially
done through indexing.

For an example where the std.algorithm/range functions don't cut it, my random format date string parser first breaks up the given character range into tokens. Once it has the tokens, it checks several known formats. One piece of that is checking if some of the tokens are in AAs of month and day names for fast tests of presence. Because the AAs are int[string], and it's unknowable the encoding of string (it's complicated), during tokenization, the character range must be forced to UTF-8 with byChar with all isSomeString!R == true inputs to avoid the auto-decoding and subsequent AA key mismatch.

Agreed. This is probably the most glaring mistake. I think we should open a discussion no fixing this everywhere in the stdlib, even at the cost of breaking code.

See the discussion here: https://issues.dlang.org/show_bug.cgi?id=14519

I think some of the proposals there are interesting.

Overall, I think the one way to make real steps forward in improving string processing in the D language is to give a clear answer of what char, wchar, and dchar mean.

If you agree that iterating over code units and code points isn't what people want/need most of the time, then I will quote something from my article on the subject:

"I really don't see the benefit of the automatic behavior fulfilling this one specific corner case when you're going to make everyone else call a range generating function when they want to iterate over code units or graphemes. Just make everyone call a range generating function to specify the type of iteration and save a lot of people the trouble!"

I think the only clear way forward is to not make strings ranges and force people to make a decision when passing them to range functions. The HUGE problem is the code this will break, which is just about all of it.

Reply via email to