Re: filename and normalization (was gcc identifiers)

Jungshik Shin Thu, 05 Dec 2002 05:09:27 -0800

On Wed, 4 Dec 2002, seer26 wrote:

> > is to insist that  11,172 modern precomposed syllables be encoded
> > in Unicode/10646. Next biggest blunder they made is to encode tens
> > of totally unnecessary cluster-Jamos when only 17+11+17+ a few more
> > would have been more than sufficient. Next stupid thing they did is
....

> Would Chinese be in a similiar situation if it the radicals were
> combining characters, and any combination of them could in theory be
> a valid character?

  Possibly. However, radicals are only a small subset of 'components'
used in Chinese characters. You need to have a lot more 'components'
than radicals listed in any Chinese character dictionary.

> In practice, of course, a normal person would use
> far fewer than 10,000 distinct characters.

  Do you think anybody  wants a character set standard(like
Unicode) to specify the list of sequences of Latin/Greek/Cyrillic
alphabets that are allowed? Imagine  that you can use 'ab, eb, ob, se,
ce' but cannot use 'sce, gh, ph' That's what encoding a fixed set of
precomposed  syllables does for Korean alphabet.

> Have you ever needed a character that wasnt among the 11,172 precomposed
> ones?

  Sure! See <http://jshin.net/i18n/korean/hunmin.html>
or <http://jshin.net/i18n/uyeo.html>. 11,172 precomposed syllables don't
include any pre-1933 orthography syllables.  The set doesn't include
modern incomplete syllables(which high school Korean teachers need to
teach Korean grammar), either. Basically, it was a very stupid idea
(and a vast waste of codespace) to enumerate possible combinations of
alphabetic letters.  Just encoding alphabetic letters should be more than
enough. I wish Korean Nat'l Standard body had been half as competent as
as its counterpart in India. ISCII (which ISO 10646/Unicode copied almost
verbatim) did a great job of encoding only what's absolutely necessary for
Indic scripts. And, that was in early 1990's when no intelligent modern
rendering engine and font were in sight. They, however, had a foresight
that encoding hundreds or thousdands of 'presentation forms' for each
of Indic scripts was not a way to go and that eventually intelligent
and advanced fonts/rendering engine would come out. They were right and
nowadays Indic scripts are pretty well supported by Pango, Uniscribe,
ATSUI, and Graphite. It may take a little more while to have opentype
fonts in public domains for all Indic scripts, but they're coming...

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
Re: filename and normalization (was gcc identifiers)

Reply via email to