On 3/8/14, 9:33 AM, Sean Kelly wrote:
On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
Andrei suggests that this change would destroy D by breaking too much
existing code. He might be right. Can we afford the risk that he is
right?

Perhaps not.  But I think the current approach is totally broken, it's
just also happens to be what people have coded to.

I think that's an exaggeration poorly supported by evidence. My definition of "totally broken" would be "essentially unusable" and I think we're well past the point we need to prove that. Virtually all applications need to deal with strings to some extent, and I myself wrote a couple of relatively string-heavy ones. D strings work well. Even the most ardent detractors of D on e.g. reddit.com admit by omission that string processing is one if its strengths. Though they'll probably pick up on this thread soon :o).

Andrei used
algorithms operating on a code point level as an example of what would
break if this change were made, and in that he's absolutely correct.
But what bothers me is whether it's appropriate to perform this sort of
character-based operation on Unicode strings in the first place.

Searching for characters in strings would be difficult to deem inappropriate.

When I designed std.algorithm I recall I put the following options on the table:

1. All algorithms would by default operate on strings at char/wchar level (i.e. code unit). That would cause the usual issues and confusions I was aware of from C++. Certain algorithms would require specialization and/or the user using byDchar for correctness. At some point I swear I've had a byDchar definition somewhere; I've searched the recent git history for it, no avail.

2. All algorithms would by default operate at code point level. That way correctness would be achieved by default, and certain algorithms would require specialization for efficiency. (Back then I didn't know about graphemes and normalization. I'm not sure how that would have affected the final decision.)

3. Change the alias string, wstring etc. to be some type that requires explicit access for code units/code points etc. instead of implicitly mixing the two.

My fave was (3). And not mine only - several people suggested alternative definitions of the "default" string type. Back then however we were in the middle of the D1/D2 transition and one more aftershock didn't seem like a good idea at all. Walter opposed such a change, and didn't really have to convince me.

From experience with C++ I knew (1) had a bad track record, and (2) "generically conservative, specialize for speed" was a successful pattern.

What would you have chosen given that context?

The current approach is a cut above treating strings as arrays of bytes
for some languages, and still utterly broken for others. If I'm
operating on a right to left language like Hebrew, what would I expect
the result to be from something like countUntil?

The entire string processing paraphernalia is left to right. I figure RTL languages are under-supported, but s.retro.countUntil comes to mind.

And how useful would
such a result be?

I don't know.

I'm inclined to say that the correct approach is to
state that algorithms operate explicitly on a T.sizeof basis and that if
the data contained in a particular range has some multi-element encoding
then separate, specialized routines should be used with the T.sizeof
behavior will not produce the desired result.

That sounds quite like C++ plus ICU. It doesn't strike me as the golden standard for Unicode integration.

So the problem to me is that we're stuck not fixing something that's
horribly broken just because it's broken in a way that people presumably
now expect.

Clearly I'm being subjective here but again I'd find it difficult to get convinced we have something horribly broken from the evidence I gathered inside and outside Facebook.

I'd personally like to see this fixed and I think the new behavior is
preferable overall, but I do share Andrei's concern that such a big
change might hurt the language anyway.

I've said this once and I'm saying it again: the best way to convert this discussion into something useful is to devise ideas for useful non-breaking additions.


Andrei

Reply via email to