On Thursday, 17 May 2018 at 05:01:54 UTC, Joakim wrote:
On Wednesday, 16 May 2018 at 20:11:35 UTC, Andrei Alexandrescu
On 5/16/18 1:18 PM, Joakim wrote:
On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky
On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei
Sigh, this reminds me of the old quote about people
spending a bunch of time making more efficient what
shouldn't be done at all.
Validating UTF-8 is super common, most text protocols and
files these days would use it, other would have an option to
I’d like our validateUtf to be fast, since right now we do
validation every time we decode string. And THAT is slow.
Trying to not validate on decode means most things should be
validated on input...
I think you know what I'm referring to, which is that UTF-8
is a badly designed format, not that input validation
shouldn't be done.
I find this an interesting minority opinion, at least from the
perspective of the circles I frequent, where UTF8 is
unanimously heralded as a great design. Only a couple of weeks
ago I saw Dylan Beattie give a very entertaining talk on
exactly this topic:
Thanks for the link, skipped to the part about text encodings,
should be fun to read the rest later.
If you could share some details on why you think UTF8 is badly
designed and how you believe it could be/have been better, I'd
be in your debt!
Unicode was a standardization of all the existing code pages
and then added these new transfer formats, but I have long
thought that they'd have been better off going with a
header-based format that kept most languages in a single-byte
This is not practical, sorry. What happens when your message
loses the header? Exactly, the rest of the message is garbled.
That's exactly what happened with code page based texts when you
don't know in which code page it is encoded. It has the
supplemental inconvenience that mixing languages becomes
impossible or at least very cumbersome.
UTF-8 has several properties that are difficult to have with
- It is state-less, means any byte in a stream always means the
same thing. Its meaning does not depend on external or a
- It can mix any language in the same stream without acrobatics
and if one thinks that mixing languages doesn't happen often
should get his head extracted from his rear, because it is very
common (check wikipedia's front page for example).
- The multi byte nature of other alphabets is not as bad as
people think because texts in computer do not live on their own,
meaning that they are generally embedded inside file formats,
which more often than not are extremely bloated (xml, html,
xliff, akoma ntoso, rtf etc.). The few bytes more in the text do
not weigh that much.
I'm in charge at the European Commission of the biggest
translation memory in the world. It handles currently 30
languages and without UTF-8 and UTF-16 it would be unmanageable.
I still remember when I started there in 2002 when we handled
only 11 languages of which only 1 was of another alphabet
(Greek). Everything was based on RTF with codepages and it was a
braindead mess. My first job in 2003 was to extend the system to
handle the 8 newcomer languages and with ASCII based encodings it
was completely unmanageable because every document processed
mixes languages and alphabets freely (addresses and names are
often written in their original form for instance).
2 years ago we implemented also support for Chinese. The nice
thing was that we didn't have to change much to do that thanks to
Unicode. The second surprise was with the file sizes, Chinese
documents were generally smaller than their European
counterparts. Yes CJK requires 3 bytes for each ideogram, but
generally 1 ideogram replaces many letters. The ideogram 亿
replaces "One hundred million" for example, which of them take
more bytes? So if CJK indeed requires more bytes to encode, it is
firstly because they NEED many more bits in the first place
(there are around 30000 CJK codepoints in the BMP alone, add to
it the 60000 that are in the SIP and we have a need of 17 bits
only to encode them.
as they mostly were except for obviously the Asian CJK
languages. That way, you optimize for the common string, ie one
that contains a single language or at least no CJK, rather than
pessimizing every non-ASCII language by doubling its character
width, as UTF-8 does. This UTF-8 issue is one of the first
topics I raised in this forum, but as you noted at the time
nobody agreed and I don't want to dredge that all up again.
I have been researching this a bit since then, and the stated
goals for UTF-8 at inception were that it _could not overlap
with ASCII anywhere for other languages_, to avoid issues with
legacy software wrongly processing other languages as ASCII,
and to allow seeking from an arbitrary location within a byte
I have no dispute with these priorities at the time, as they
were optimizing for the institutional and tech realities of
1992 as Dylan also notes, and UTF-8 is actually a nice hack
given those constraints. What I question is that those
priorities are at all relevant today, when billions of
smartphone users are regularly not using ASCII, and these tech
companies are the largest private organizations on the planet,
ie they have the resources to design a new transfer format. I
see basically no relevance for the streaming requirement today,
as I noted in this forum years ago, but I can see why it might
have been considered important in the early '90s, before
packet-based networking protocols had won.
I think a header-based scheme would be _much_ better today and
the reason I know Dmitry knows that is that I have discussed
privately with him over email that I plan to prototype a format
like that in D. Even if UTF-8 is already fairly widespread,
something like that could be useful as a better intermediate
format for string processing, and maybe someday could replace