Re: Of possible interest: fast UTF8 validation

Jonathan M Davis via Digitalmars-d Wed, 16 May 2018 14:12:22 -0700

On Wednesday, May 16, 2018 13:42:11 Walter Bright via Digitalmars-d wrote:
> On 5/16/2018 1:11 PM, Andrei Alexandrescu wrote:
> > If you could share some details on why you think UTF8 is badly designed
> > and how you believe it could be/have been better, I'd be in your debt!
>
> Me too. I think UTF-8 is brilliant (and I suffered for years under the
> lash of other multibyte encodings prior to UTF-8). Shift-JIS: shudder!
>
> Perhaps you're referring to the redundancy in UTF-8 - though illegal
> encodings are made possible by such redundancy.


I'm inclined to think that the redundancy is a serious flaw. I'd argue that
if it were truly well-designed, there would be exactly one way to represent
every character - including clear up to grapheme clusters where multiple
code points are involved (i.e. there would be no normalization issues in
valid Unicode, because there would be only one valid normalization). But
there may be some technical issues that I'm not aware of that would make
that problematic. Either way, the issues that I have with UTF-8 are issues
that UTF-16 and UTF-32 have as well, since they're really issues relating to
code points.

Overall, I think that UTF-8 is by far the best encoding that we have, and I
don't think that we're going to get anything better, but I'm also definitely
inclined to think that it's still flawed - just far less flawed than the
alternatives.

And in general, I have to wonder if there would be a way to make Unicode
less complicated if we could do it from scratch without worrying about any
kind of compatability, since what we have is complicated enough that most
programmers don't come close to understanding it, and it's just way too hard
to get right. But I suspect that if efficiency matters, there's enough
inherent complexity that we'd just be screwed on that front even if we could
do a better job than was done with Unicode as we know it.

- Jonathan M Davis

Re: Of possible interest: fast UTF8 validation

Reply via email to