Re: Of possible interest: fast UTF8 validation

Joakim via Digitalmars-d Wed, 16 May 2018 22:05:45 -0700

On Wednesday, 16 May 2018 at 20:11:35 UTC, Andrei Alexandrescuwrote:

On 5/16/18 1:18 PM, Joakim wrote:
On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshanskywrote:
On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
On Wednesday, 16 May 2018 at 11:18:54 UTC, AndreiAlexandrescu wrote:
https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/
Sigh, this reminds me of the old quote about people spendinga bunch of time making more efficient what shouldn't be doneat all.
Validating UTF-8 is super common, most text protocols andfiles these days would use it, other would have an option todo so.
I’d like our validateUtf to be fast, since right now we dovalidation every time we decode string. And THAT is slow.Trying to not validate on decode means most things should bevalidated on input...
I think you know what I'm referring to, which is that UTF-8 isa badly designed format, not that input validation shouldn'tbe done.
I find this an interesting minority opinion, at least from theperspective of the circles I frequent, where UTF8 isunanimously heralded as a great design. Only a couple of weeksago I saw Dylan Beattie give a very entertaining talk onexactly this topic:https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/

Thanks for the link, skipped to the part about text encodings,should be fun to read the rest later.

If you could share some details on why you think UTF8 is badlydesigned and how you believe it could be/have been better, I'dbe in your debt!

Unicode was a standardization of all the existing code pages andthen added these new transfer formats, but I have long thoughtthat they'd have been better off going with a header-based formatthat kept most languages in a single-byte scheme, as they mostlywere except for obviously the Asian CJK languages. That way, youoptimize for the common string, ie one that contains a singlelanguage or at least no CJK, rather than pessimizing everynon-ASCII language by doubling its character width, as UTF-8does. This UTF-8 issue is one of the first topics I raised inthis forum, but as you noted at the time nobody agreed and Idon't want to dredge that all up again.

I have been researching this a bit since then, and the statedgoals for UTF-8 at inception were that it _could not overlap withASCII anywhere for other languages_, to avoid issues with legacysoftware wrongly processing other languages as ASCII, and toallow seeking from an arbitrary location within a byte stream:


https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

I have no dispute with these priorities at the time, as they wereoptimizing for the institutional and tech realities of 1992 asDylan also notes, and UTF-8 is actually a nice hack given thoseconstraints. What I question is that those priorities are at allrelevant today, when billions of smartphone users are regularlynot using ASCII, and these tech companies are the largest privateorganizations on the planet, ie they have the resources to designa new transfer format. I see basically no relevance for thestreaming requirement today, as I noted in this forum years ago,but I can see why it might have been considered important in theearly '90s, before packet-based networking protocols had won.

I think a header-based scheme would be _much_ better today andthe reason I know Dmitry knows that is that I have discussedprivately with him over email that I plan to prototype a formatlike that in D. Even if UTF-8 is already fairly widespread,something like that could be useful as a better intermediateformat for string processing, and maybe someday could replaceUTF-8 too.

Re: Of possible interest: fast UTF8 validation

Reply via email to