On Friday, May 27, 2016 16:41:09 Andrei Alexandrescu via Digitalmars-d wrote: > On 05/27/2016 03:43 PM, H. S. Teoh via Digitalmars-d wrote: > > That's what we've been trying to say all along! > > If that's the case things are pretty dire, autodecoding or not. -- Andrei
True enough. Correctly handling Unicode in the general case is ridiculously hard - especially if you want to be efficient. We could do everything at the grapheme level to get the correctness, but we'd be so slow that it would be ridiculous. Fortunately, many string algorithms really don't need to care much about Unicode so long as the strings involved are normalized. For instance, a function like find can usually compare code units without decoding anything (though even then, depending on the normalization, you run the risk of finding a part of a character if it involves combining code points - e.g. searching for e could give you the first part of é if its encoded with the e followed by the accent). But ultimately, fully correct string handling requires having a far better understanding of Unicode than most programmers have. Even the percentage of programmers here that have that level of understanding isn't all that great - though the fact that D supports UTF-8, UTF-16, and UTF-32 the way that it does has led a number of us to dig further into Unicode and learn it better in ways that we probably wouldn't have if all it had was char. It highlights that there is something that needs to be learned to get this right in a way that most languages don't. - Jonathan M Davis
