Re: Why is utf8 the default in D?

Andrei Alexandrescu Mon, 27 Apr 2009 05:00:17 -0700

Michel Fortin wrote:

On 2009-04-27 07:04:06 -0400, Frank Benoit <[email protected]>said:
M$ and Java have chosen to use utf16 as their default Unicode character
encoding. I am sure, the decision was not made without good reasoning.

What are their arguments?
Why does D propagate utf8 as the default?

E.g.
Exception.msg
Object.toString()
new std.stream.File( char[] )
The argument at the time was that they were going to work directly withUnicode code points, thus simplifying things. Then Unicode extended tocover even more characters, and at some point 16 bit becameinsufficient; 16-bit encoding of Unicode became UTF-16, and surrogatepairs were added to allow it to contain even higher code points, making16-bit unicode a variable-size character encoding now known as UTF-16.
So it turns out that those in the early years of Unicode who made thatchoice made it for reasons that no longer exist. Today, variable-sizeUTF-16 makes it as hard to calculate a string length and do randomaccess in a string as UTF-8. In practice, many frameworks just ignorethe problem and are happy counting each code of a surrogate pair as twocharacters, as they always did, but that behaviour isn't exactly correct.
To get the benefit those framework/language designers though they'd getat the time, we'd have to go with UTF-32, but then storing stringsbecome immensely wasteful. And I'm not counting that most data exchangethese days have standardized with UTF-8, rarely you'll encounter UTF-16in the wild (and when you do, you have take care about UTF-16 LE andBE), and even more rare is UTF-32. And that's not counting that perhaps,one day, Unicode will grow again and fall outside of its 32-bit range...although that may have to wait until learn a few extraterrestriallanguages. :-)
So the D solution, which is to use UTF-8 everywhere while stillsupporting string operations using UTF-16 and UTF-32, looks very good tome. What I actually do is use UTF-8 everywhere, and sometime when I needto easily manipulate characters I use UTF-32. And I use UTF-16 fordealing with APIs expecting it, but not for much else.

Well put. I distinctly remember the hubbub around Java's UTF16 supportthat was solving all of strings' problems, followed by the embarrassedsilence upon the introduction of UTF32.


Andrei

Re: Why is utf8 the default in D?

Reply via email to