At 11:02 AM 1/18/02 -0800, Barry Caplan wrote:
>I've always been under the impression that one of the original goals of 
>the Unicode effort was to do away with he sort of multi-width encodings we 
>are all too familiar with (EUC, JIS, SJIS, etc.). this was to be 
>accomplished by using a fixed width encoding. In my mind, everything other 
>than that in order to increase space (but not necessarily to save 
>bandwidth) is a kluge, and a compromise, because it means code still has 
>to be aware of the details of the the encoding scheme.

That was one of the original goals - however, it was to be achieved by 
'composing' a lot more characters than are composed (or even composable) 
now. Such an ideal Unicode would have had no Arabic Presentation forms 
(saves about 1K codes), no Hangul syllables (saves 11K codes), no Han 
variants (large, but harder to estimate savings), no polytonic Greek and 
many fewer Latin pre-composed characters (ideally, it should have had none 
in which case the savings would have been another K or so).

Under such a system, all the scripts coded today would have fit in the BMP, 
with lots of room to spare. However, the majority of al users would have 
had to have systems that could handle character code sequences of variable 
length for many common tasks.

By using variable length enoding like UTF-16, which front loads practically 
all the commonly used characters into the single unit case, and which, 
unlike combining character sequences are of only two possible lengths (1 
and 2), which can be determined at the start of the sequence, and which use 
code positions that don't alias between single unit code points and part of 
double unit sequences -- in other words, by doing all this, the fact that 
the majority of software *cannot* handle variable length sequences has 
become of much less importance than it would have with "idealCode".

Conversely, since some scripts require the use of character sequences even 
now, the gains of going to UTF-32 are limited. Yes, lower level 
infrastructure is a bit easier than UTF-16, but supporting complex scripts, 
or advanced usage of not so complex scripts, isn't.

Incidentally, using twice as much space for your character codes seems 
unimportant, when you look at network bandwidth and disk space, media that 
are shared with gigantic amounts of non-character data. Things begin to 
look a lot different when you are trying to process large amounts of 
character data and suddenly realize that your high speed cache can fit 
nearly twice more characters when you use UTF-16, compared to UTF-32.

A./

Reply via email to