On Monday, September 10, 2018 2:45:27 AM MDT Chris via Digitalmars-d wrote:
> After a while your code will be cluttered with absurd stuff like > this. `.byCodeUnit`, `.byGrapheme`, `.array` etc. Due to my > experience with `splitter` et. al. I tried to create my own > parser to have better control over every step. After a few > *minutes* of testing things I ran into this bug [1] that didn't > get fixed till early 2018. I never started to write my own > step-by-step parser. I'm glad I didn't. > > [1] https://issues.dlang.org/show_bug.cgi?id=16739 > > [snip] I suspect that that that didn't get found sooner simply because using Unicode in a switch statement is rare. Usually, Unicode characters are found in program input and not in the program itself. And grammars typically only involve ASCII characters (even D, which supports Unicode characters in identfiers, doesn't have any Unicode in any of its symbols). So, while I completely agree that using Unicode in switch statements should work, it doesn't really surprise me that it was broken. That's really a large part of the Unicode problem. Regardless of how a particular language or library attempst to make using Unicode sane, a large percentage of programmers don't ever do anything with Unicode characters (even if their programs are often used in environments where they will end up processing Unicode characters), and even when a programmer's native tongue requires Unicode characters, their programs frequently do not. So, it becomes very easy to write code that doesn't work properly with Unicode and have no clue that it doesn't. Fortunately, D does provide better tools than many languages for handling Unicode, but the auto-decoding mess has made it considerably worse. Still, even if we'd gotten it right, some portion of the code out there have to have something like byCodeUnit, byCodePoint, or byGrapheme, because efficient Unicode processing requires that you deal with all of that mess. The code that doesn't have to do any of that is generally code that treats strings as opaque data. Once you actually have to do string processing, you're pretty much screwed. Doing everything at the grapheme level would eliminate most of the problems with regards to user-friendliness, but it would kill efficiency. So, as far as I can tell, there really isn't a great solution to be had. Unicode is simply too complicated and messy by its very nature. Now, we've definitely made mistakes with Phobos that make it worse, but the only programs that are going to avoid this whole mess either do so by not dealing with Unicode, handling it incorrectly, or by handling it inefficiently. I think that it's pretty much a pipe dream to be able to have completely sane and efficient string handling using Unicode as its currently defined. Regardless, we need to do a better job of it in D than we have been. - Jonathan M Davis