Why UTF-8/16 character encodings?

Joakim Fri, 24 May 2013 10:10:26 -0700

On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:

toUpper/lower cannot be made in place if it should handle allUnicode. Some characters will change their length when convertto/from uppercase. Examples of these are the German double Sand some Turkish I.

This triggered a long-standing bugbear of mine: why are we usingthese variable-length encodings at all? Does anybody really careabout UTF-8 being "self-synchronizing," ie does anybody actuallyuse that in this day and age? Sure, it's backwards-compatiblewith ASCII and the vast majority of usage is probably just ASCII,but that means the other languages don't matter anyway. Not tomention taking the valuable 8-bit real estate for English anddumping the longer encodings on everyone else.

I'd just use a single-byte header to signify the language andthen put the vast majority of languages in a single byteencoding, with the few exceptional languages with more than 256characters encoded in two bytes. OK, that doesn't covermulti-language strings, but that is what, .000001% of usage?Make your header a little longer and you could handle those also.Yes, it wouldn't be strictly backwards-compatible with ASCII,but it would be so much easier to internationalize. Of course,there's also the monoculture we're creating; love this UTF-8 rantby tuomov, author of one the first tiling window managers forlinux:


http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06

The emperor has no clothes, what am I missing?

Why UTF-8/16 character encodings?

Reply via email to