On Wednesday, May 16, 2018 13:42:11 Walter Bright via Digitalmars-d wrote: > On 5/16/2018 1:11 PM, Andrei Alexandrescu wrote: > > If you could share some details on why you think UTF8 is badly designed > > and how you believe it could be/have been better, I'd be in your debt! > > Me too. I think UTF-8 is brilliant (and I suffered for years under the > lash of other multibyte encodings prior to UTF-8). Shift-JIS: shudder! > > Perhaps you're referring to the redundancy in UTF-8 - though illegal > encodings are made possible by such redundancy.
I'm inclined to think that the redundancy is a serious flaw. I'd argue that if it were truly well-designed, there would be exactly one way to represent every character - including clear up to grapheme clusters where multiple code points are involved (i.e. there would be no normalization issues in valid Unicode, because there would be only one valid normalization). But there may be some technical issues that I'm not aware of that would make that problematic. Either way, the issues that I have with UTF-8 are issues that UTF-16 and UTF-32 have as well, since they're really issues relating to code points. Overall, I think that UTF-8 is by far the best encoding that we have, and I don't think that we're going to get anything better, but I'm also definitely inclined to think that it's still flawed - just far less flawed than the alternatives. And in general, I have to wonder if there would be a way to make Unicode less complicated if we could do it from scratch without worrying about any kind of compatability, since what we have is complicated enough that most programmers don't come close to understanding it, and it's just way too hard to get right. But I suspect that if efficiency matters, there's enough inherent complexity that we'd just be screwed on that front even if we could do a better job than was done with Unicode as we know it. - Jonathan M Davis
