On Tuesday, 31 May 2016 at 20:20:46 UTC, Marco Leise wrote:
Am Tue, 31 May 2016 16:29:33 +0000
schrieb Joakim <[email protected]>:
Part of it is the complexity of written language, part of it
is bad technical decisions. Building the default string type
in D around the horrible UTF-8 encoding was a fundamental
mistake, both in terms of efficiency and complexity. I noted
this in one of my first threads in this forum, and as Andrei
said at the time, nobody agreed with me, with a lot of
hand-waving about how efficiency wasn't an issue or that UTF-8
arrays were fine. Fast-forward years later and exactly the
issues I raised are now causing pain.
Maybe you can dig up your old post and we can look at each of
your complaints in detail.
Not interested. I believe you were part of that thread then.
Google it if you want to read it again.
UTF-8 is an antiquated hack that needs to be eradicated. It
forces all other languages than English to be twice as long,
for no good reason, have fun with that when you're downloading
text on a 2G connection in the developing world. It is
unnecessarily inefficient, which is precisely why
auto-decoding is a problem. It is only a matter of time till
UTF-8 is ditched.
You don't download twice the data. First of all, some
languages had two-byte encodings before UTF-8, and second web
content is full of HTML syntax and gzip compressed afterwards.
The vast majority can be encoded in a single byte, and are
unnecessarily forced to two bytes by the inefficient UTF-8/16
encodings. HTML syntax is a non sequitur; compression helps but
isn't as efficient as a proper encoding.
Take this Thai Wikipedia entry for example:
https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2
The download of the gzipped html is 11% larger in UTF-8 than
in Thai TIS-620 single-byte encoding. And that is dwarfed by
the size of JS + images. (I don't have the numbers, but I
expect the effective overhead to be ~2%).
Nobody on a 2G connection is waiting minutes to download such
massive web pages. They are mostly sending text to each other on
their favorite chat app, and waiting longer and using up more of
their mobile data quota if they're forced to use bad encodings.
Ironically a lot of symbols we take for granted would then
have to be implemented as HTML entities using their Unicode
code points(sic!). Amongst them basic stuff like dashes, degree
(°) and minute (′), accents in names, non-breaking space or
footnotes (↑).
No, they just don't use HTML, opting for much superior mobile
apps instead. :)
D devs should lead the way in getting rid of the UTF-8
encoding, not bickering about how to make it more palatable.
I suggested a single-byte encoding for most languages, with
double-byte for the ones which wouldn't fit in a byte. Use
some kind of header or other metadata to combine strings of
different languages, _rather than encoding the language into
every character!_
That would have put D on an island. "Some kind of header" would
be a horrible mess to have in strings, because you have to
account for it when concatenating strings and scan for them all
the time to see if there is some interspersed 2 byte encoding
in the stream. That's hardly better than UTF-8. And yes, a huge
amount of websites mix scripts and a lot of other text uses the
available extra symbols like ° or α,β,γ.
Let's see: a constant-time addition to a header or constantly
decoding every character every time I want to manipulate the
string... I wonder which is a better choice?! You would not
"intersperse" any other encodings, unless you kept track of those
substrings in the header. My whole point is that such mixing of
languages or "extra symbols" is an extreme minority use case: the
vast majority of strings are a single language.
The common string-handling use case, by far, is strings with
only one language, with a distant second some substrings in a
second language, yet here we are putting the overhead into
every character to allow inserting characters from an
arbitrary language! This is madness.
No thx, madness was when we couldn't reliably open text files,
because nowhere was the encoding stored and when you had to
compile programs for each of a dozen codepages, so localized
text would be rendered correctly. And your retro codepage
system wont convince the world to drop Unicode either.
Unicode _is_ a retro codepage system, they merely standardized a
bunch of the most popular codepages. So that's not going away no
matter what system you use. :)
Yes, the complexity of diacritics and combining characters
will remain, but that is complexity that is inherent to the
variety of written language. UTF-8 is not: it is just a bad
technical decision, likely chosen for ASCII compatibility and
some misguided notion that being able to combine arbitrary
language strings with no other metadata was worthwhile. It is
not.
The web proves you wrong. Scripts do get mixed often. Be it
Wikipedia, a foreign language learning site or mathematical
symbols.
Those are some of the least-trafficked parts of the web, which
itself is dying off as the developing world comes online through
mobile apps, not the bloated web stack.
Anyway, I'm not interested in rehashing this dumb argument again.
The UTF-8/16 encodings are a horrible mess, and D made a big
mistake by baking them in.