On Tuesday, 31 May 2016 at 16:29:33 UTC, Joakim wrote:
UTF-8 is an antiquated hack that needs to be eradicated. It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world.

I assume you're talking about the web here. In this case, plain text makes up only a minor part of the entire traffic, the majority of which is images (binary data), javascript and stylesheets (almost pure ASCII), and HTML markup (ditto). It's like not significant even without taking compression into account, which is ubiquitous.

It is unnecessarily inefficient, which is precisely why auto-decoding is a problem.

No, inefficiency is the least of the problems with auto-decoding.

It is only a matter of time till UTF-8 is ditched.

This is ridiculous, even if your other claims were true.


D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable. I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_

I think I remember that post, and - sorry to be so blunt - it was one of the worst things I've ever seen proposed regarding text encoding.


The common string-handling use case, by far, is strings with only one language, with a distant second some substrings in a second language, yet here we are putting the overhead into every character to allow inserting characters from an arbitrary language! This is madness.

No. The common string-handling use case is code that is unaware which script (not language, btw) your text is in.

Reply via email to