Re: The Case Against Autodecode

Joakim via Digitalmars-d Tue, 31 May 2016 13:51:30 -0700

On Tuesday, 31 May 2016 at 20:20:46 UTC, Marco Leise wrote:

Am Tue, 31 May 2016 16:29:33 +0000
schrieb Joakim <[email protected]>:
Part of it is the complexity of written language, part of itis bad technical decisions. Building the default string typein D around the horrible UTF-8 encoding was a fundamentalmistake, both in terms of efficiency and complexity. I notedthis in one of my first threads in this forum, and as Andreisaid at the time, nobody agreed with me, with a lot ofhand-waving about how efficiency wasn't an issue or that UTF-8arrays were fine. Fast-forward years later and exactly theissues I raised are now causing pain.
Maybe you can dig up your old post and we can look at each ofyour complaints in detail.

Not interested. I believe you were part of that thread then.Google it if you want to read it again.

UTF-8 is an antiquated hack that needs to be eradicated. Itforces all other languages than English to be twice as long,for no good reason, have fun with that when you're downloadingtext on a 2G connection in the developing world. It isunnecessarily inefficient, which is precisely whyauto-decoding is a problem. It is only a matter of time tillUTF-8 is ditched.
You don't download twice the data. First of all, some
languages had two-byte encodings before UTF-8, and second web
content is full of HTML syntax and gzip compressed afterwards.

The vast majority can be encoded in a single byte, and areunnecessarily forced to two bytes by the inefficient UTF-8/16encodings. HTML syntax is a non sequitur; compression helps butisn't as efficient as a proper encoding.

Take this Thai Wikipedia entry for example:
https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2
The download of the gzipped html is 11% larger in UTF-8 than
in Thai TIS-620 single-byte encoding. And that is dwarfed by
the size of JS + images. (I don't have the numbers, but I
expect the effective overhead to be ~2%).

Nobody on a 2G connection is waiting minutes to download suchmassive web pages. They are mostly sending text to each other ontheir favorite chat app, and waiting longer and using up more oftheir mobile data quota if they're forced to use bad encodings.

Ironically a lot of symbols we take for granted would then
have to be implemented as HTML entities using their Unicode
code points(sic!). Amongst them basic stuff like dashes, degree
(°) and minute (′), accents in names, non-breaking space or
footnotes (↑).

No, they just don't use HTML, opting for much superior mobileapps instead. :)

D devs should lead the way in getting rid of the UTF-8encoding, not bickering about how to make it more palatable.I suggested a single-byte encoding for most languages, withdouble-byte for the ones which wouldn't fit in a byte. Usesome kind of header or other metadata to combine strings ofdifferent languages, _rather than encoding the language intoevery character!_
That would have put D on an island. "Some kind of header" wouldbe a horrible mess to have in strings, because you have toaccount for it when concatenating strings and scan for them allthe time to see if there is some interspersed 2 byte encodingin the stream. That's hardly better than UTF-8. And yes, a hugeamount of websites mix scripts and a lot of other text uses theavailable extra symbols like ° or α,β,γ.

Let's see: a constant-time addition to a header or constantlydecoding every character every time I want to manipulate thestring... I wonder which is a better choice?! You would not"intersperse" any other encodings, unless you kept track of thosesubstrings in the header. My whole point is that such mixing oflanguages or "extra symbols" is an extreme minority use case: thevast majority of strings are a single language.

The common string-handling use case, by far, is strings withonly one language, with a distant second some substrings in asecond language, yet here we are putting the overhead intoevery character to allow inserting characters from anarbitrary language! This is madness.
No thx, madness was when we couldn't reliably open text files,because nowhere was the encoding stored and when you had tocompile programs for each of a dozen codepages, so localizedtext would be rendered correctly. And your retro codepagesystem wont convince the world to drop Unicode either.

Unicode _is_ a retro codepage system, they merely standardized abunch of the most popular codepages. So that's not going away nomatter what system you use. :)

Yes, the complexity of diacritics and combining characterswill remain, but that is complexity that is inherent to thevariety of written language. UTF-8 is not: it is just a badtechnical decision, likely chosen for ASCII compatibility andsome misguided notion that being able to combine arbitrarylanguage strings with no other metadata was worthwhile. It isnot.
The web proves you wrong. Scripts do get mixed often. Be itWikipedia, a foreign language learning site or mathematicalsymbols.

Those are some of the least-trafficked parts of the web, whichitself is dying off as the developing world comes online throughmobile apps, not the bloated web stack.

Anyway, I'm not interested in rehashing this dumb argument again.The UTF-8/16 encodings are a horrible mess, and D made a bigmistake by baking them in.

Re: The Case Against Autodecode

Reply via email to