Re: Unicode

Frank Atanassow Tue, 16 May 2000 03:23:14 -0700
George Russell writes:
 > Marcin 'Qrczak' Kowalczyk wrote:
 > > As for the language standard: I hope that Char will be allowed or
 > > required to have >=30 bits instead of current 16; but never more than
 > > Int, to be able to use ord and chr safely.
 > Er does it have to?  The Java Virtual Machine implements Unicode with
 > 16 bits.  (OK, so I suppose that means it can't cope with Korean or Chinese.)

Just to set the record straight:

Many CJK (Chinese-Japanese-Korean) characters are encodable in 16 bits. I am
not so familiar with the Chinese or Korean situations, but in Japan there is a
nationally standardized subset of about 2000 characters called the Jyouyou
("often-used") kanji, which newspapers and most printed books are mostly
supposed to respect. These are all strictly contained in the 16-bit space. One
only needs the additional 16-bits for foreign characters (say, Chinese), older
literary works and such-like. Even then, since Japanese has two phoenetic
alphabets as well, and you can usually substitute phoenetic characters in the
place of non-Jyouyou kanji---in fact, since these kanji are considered
difficult, one often _does_ supplement the ideographic representation with a
phoenetic one. Of course, using only phoenetic characters in such cases would
look unprofessional in some contexts, and it forces the reader to guess at
which word was meant...

For Korean and especially Chinese, the situation is not so pat. Korean's
phoenetic alphabet is of course wholly contained within the 16 bit space, but
Chinese, as a rule, don't use phoenetic characters. Koreans rely on their
phoenetic alphabet more than the Japanese, but they still tend to use (I
believe) more esoteric Chinese ideographic characters than the Japanese
do. And the Chinese have a much larger set of ideographic characters in common
use than either of the other two. I'm not sure what percentage is contained in
the 16-bit space; it's probably enough that you can communicate most
non-specialized subjects fairly comfortably, but it is safe to say that the
Chinese would be the first to demand more encoding space.

In summary, 16 bits is enough to encode most modern texts if you don't mind
fudging a bit, but for high-quality productions, historical and/or specialized
texts, CJK users will want 32 bits.

Of course, you can always come up with specialized schemes involving stateful
encodings and/or "block-swapping" (using the Unicode private-use areas, for
example), but then, that subverts the purpose of Unicode.

-- 
Frank Atanassow, Dept. of Computer Science, Utrecht University
Padualaan 14, PO Box 80.089, 3508 TB Utrecht, Netherlands
Tel +31 (030) 253-1012, Fax +31 (030) 251-3791
Re: Unicode

Reply via email to