On Sun, Aug 26, 2018 at 11:12:10PM +0000, FeepingCreature via Digitalmars-d wrote: [...] > Can I just throw in here that I like autodecoding and I think it's > good? If you want ranges that iterate over bytes, then just use > arrays of bytes. If you want Latin1 text, use Latin1 strings. If you > want Unicode, you get Unicode iteration. This seems right and proper > to me. Hell I'd love if the language was *more* aggressive about > validating casts to strings.
Actually, this is exactly the point that makes autodecoding so bad, because it *looks like* correct Unicode iteration over characters, but it actually isn't. It's iteration over Unicode *code points*, which is not the same thing as iteration over what people would think of as "characters", which in Unicode is called graphemes (cf. byGrapheme). So iterating over strings like "a\u301" will give you two codepoints, even though it actually renders as a single grapheme. Unfortunately, most of the time the iteration will look correct -- in most European languages, so the programmer will suspect nothing wrong. Until the code is then given a non-European Unicode string. Then it starts getting wrong behaviour. Not to mention that this incomplete solution represents an across-the-board performance hit on all string-processing code (unless it was explicitly written to bypass autodecoding with something like byCodeUnit), even if the code in question doesn't even care about Unicode and treats the strings as opaque byte sequences. The illusion of simplicity and correctness that autodecoding gives is misleading, and makes programmers think their code is OK, when the fact of the matter is that to handle Unicode correctly, you *have* to actually know that Unicode is and how it works. You simply cannot pretend that it bears any resemblance to the ASCII days of one code unit per character (no, not even with UTF-32) and expect your code to behave correctly with all valid Unicode input strings. In fact, this very illusion was what made Andrei choose to go with autodecoding in the first place, thinking that it would default to correct behaviour. Unfortunately, the reality didn't match up with that expectation. The ideal solution would have been to make strings non-iterable by default, and only iterable when the programmer chooses the mode of iteration (explicitly specify byCodeUnit, byCodePoint, or byGrapheme). T -- What do you call optometrist jokes? Vitreous humor.