On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
toUpper/lower cannot be made in place if it should handle all Unicode. Some characters will change their length when convert to/from uppercase. Examples of these are the German double S and some Turkish I.

This triggered a long-standing bugbear of mine: why are we using these variable-length encodings at all? Does anybody really care about UTF-8 being "self-synchronizing," ie does anybody actually use that in this day and age? Sure, it's backwards-compatible with ASCII and the vast majority of usage is probably just ASCII, but that means the other languages don't matter anyway. Not to mention taking the valuable 8-bit real estate for English and dumping the longer encodings on everyone else.

I'd just use a single-byte header to signify the language and then put the vast majority of languages in a single byte encoding, with the few exceptional languages with more than 256 characters encoded in two bytes. OK, that doesn't cover multi-language strings, but that is what, .000001% of usage? Make your header a little longer and you could handle those also. Yes, it wouldn't be strictly backwards-compatible with ASCII, but it would be so much easier to internationalize. Of course, there's also the monoculture we're creating; love this UTF-8 rant by tuomov, author of one the first tiling window managers for linux:

http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06

The emperor has no clothes, what am I missing?

Reply via email to