On Sunday, August 26, 2018 5:12:10 PM MDT FeepingCreature via Digitalmars-d wrote: > On Sunday, 26 August 2018 at 22:44:05 UTC, Walter Bright wrote: > > On 8/26/2018 8:43 AM, Chris wrote: > >> I wanted to get rid of autodecode and I even offered to test > >> it on my string heavy code to see what breaks (and maybe write > >> guidelines for the transition), but somehow the whole idea of > >> getting rid of autodecode was silently abandoned. What more > >> could I do? > > > > It's not silently abandoned. It will break just about every D > > program out there. I have a hard time with the idea that > > breakage of old code is inexcusable, so let's break every old > > program? > > Can I just throw in here that I like autodecoding and I think > it's good? > If you want ranges that iterate over bytes, then just use arrays > of bytes. If you want Latin1 text, use Latin1 strings. If you > want Unicode, you get Unicode iteration. This seems right and > proper to me. Hell I'd love if the language was *more* aggressive > about validating casts to strings.
The problem is that auto-decoding doesn't even give you correct Unicode handling. At best, it's kind of like using UTF-16 instead of ASCII but assuming that a UTF-16 code unit can always contain an entire character (which is frequently what you get in programs written in languages like Java or C#). A bunch more characters then work properly, but plenty of characters still don't. It's just a lot harder to realize it, because it's far from fail-fast. In general, doing everything at the code point level with Unicode (as auto-decoding does) is very much broken. It's just that it's a lot less obvious, because that much more works - and it comes with the bonus of being far less efficient. If you wanted everything to "just work" out of the box without having to worry about Unicode, you could probably do it if everything operated at the grapheme cluster level, but that be would horribly inefficient. The sad reality is that if you want your string-processing code to be at all fast while still being correct, you have to have at least a basic understanding of Unicode and use it correctly - and that rarely means doing much of anything at the code point level. It's much more likely that it needs to be at either the code unit or grapheme level. But either way, without a programmer understanding the details and programming accordingly, the code is just plain going to be wrong somewhere. The idea that we can have string-processing "just work" without the programmer having to worry about the details of Unicode is unfortunately largely a fallacy - at least if you care about efficiency. By operating at the code point level, we're just generating code that looks like it works when it doesn't really, and it's less efficient. It certainly works in more cases than just using ASCII would, but it's still broken for Unicode handling just like if the code were assuming that char was always an entire character. As such, I don't really see how there can be much defense for auto-decoding. It was done on the incorrect assumption that code points represented actually characters (for that you actually need graphemes) and that the loss in speed was worth the correctness, with the idea that anyone wanting the speed could work around the auto-decoding. We could get something like that if we went to the grapheme level, but that would hurt performance that much more. Either way, operating at the code point level everywhere is just plain wrong. This isn't just a case of "it's annoying" or "we're don't like it." It objectively results in incorrect code. - Jonathan M Davis