Re: [idn] What's wrong with skwan-utf8?

Martin J. Duerst Wed, 03 Jan 2001 07:00:01 -0800

At 00/12/26 09:52 +0100, Patrik F$BgM(Btstr$B�N(B wrote: >At 16.19 -0800 00-12-25, D. J. Bernstein wrote: >> > Also, there is a question whether UTF-8 is really what we should use. >> >>Many systems use UTF-8 internally. It takes less work for them to read >>and write UTF-8 than for them to handle text in other character sets. > >UTF-8 is not a character set. It is an encoding of the Unicode/10646 >character set. Please stop using the term 'character set'. The MIME documents have made that term unusable. > I was not questioning Unicode/10646, but if the encoding is right. I > personally feel an ACE encoding is easier for the short term, and a 32 > bit solution for the longterm (but with a dictionary approach). Why a 32 bit solution? Both Unicode and ISO 10646 are now just 17*(256**2), roughly a million codepoints or somewhat below 21 bits. For ISO 10646, that has changed rather recently, so that fact may not yet have propagated everywhere. >I don't feel a system based on a weird encoding such as UTF-8 where we >"penalize" some characters in the character set is the right way of going >if we are to find _THE_RIGHT_ solution. Well, to use an analogy, maybe we should all become as poor as possible, to make sure there is no injustice between richer and poorer people anymore? Seriously, I think that 32 bits is overengineering by throwing resources out of the window. ACE is overengineering by bit-fiddlers. The number of different ACE proposals is a good measure of the arbitrariness of the ACE approach. Also, the research I know about typical Internet traffic (i.e. Web pages) shows that UTF-8 is not less efficient than UTF-16 for e.g. Japanese. Even if we make everything multilingual, there is still a lot of ASCII or binary protocol overhead. And going to 32 bits for the occasional Egyptian hieroglyph or Klingon character that may benefit from 32 bits when it will be encoded in 5 or 10 years isn't really that attractive. >>Quite a few programs will Just Work(tm) if IDNs are defined as UTF-8, >>while they'll have to be upgraded if IDNs are defined any other way. > >This is completely false because the big thing regarding IDN is definitly >not what charset to use, or what encoding of Unicode (if it is UTF-8 or >ACE encoded) but the need for the nameprep algorithm. This is true very much on the registration side, and in theory on the end user side. But in practice, on the end user side it's rather irrelevant. A typical example is an 'fi' ligature, which should not be allowed, but mapped to two separate characters. If we help to make sure nobody registers a name with an 'fi' ligature, then we are fine, nobody will take the pains to input an 'fi' ligature from an 'f', 'i' letter sequence on a billboard or name card. And it's true that it's easier for many programs to work with UTF-8, because they already support it, rather than with some as of yet not selected (and maybe not even though off) ACE. This in particular applies to web browsers, the place where people are looking most for multilingual domain names. Regards, Martin.

Re: [idn] What's wrong with skwan-utf8?

Reply via email to