RE: filename and normalization (was gcc identifiers)

Jungshik Shin Wed, 04 Dec 2002 15:57:00 -0800

On Wed, 4 Dec 2002, Maiorana, Jason wrote:

> >> For that reason, I dont like form D at all.  I wonder how much space
> >> it would take to represent every possible Jamo-combination, then just
> >> do away with combining characters alltogether...
> >  No way!!  The biggest blunder ever made by Korean nat'l standard body
> >is to insist that  11,172 modern precomposed syllables be encoded
> >in Unicode/10646. Next biggest blunder they made is to encode tens
......
> >available in 20.1 bit coded character set which is ISO 10646/Unicode.
>
> Wow, ok, I guess that idea wont work for Korean.
> Also, since glyph swapping has to be done for merely adjacent
> characters,
> doing it for combining ones must be a relatively minor concern.
>
> Out of curiousity, how many of those Korean letters are actually
> made use of by the language? 1.5 million sounds higher than any
> number of phoneme's that a human can produce....

   Needless to say, modern Korean speakers can pronounce only
a very very small fraction and chances are that the number will decrease
as time goes by because as in most other languages, speakers are on the
winning side of the battle between listeners and speakers.  You have to
understand that Korean Hangul is alphabetic and the number of possible
syllables that can be made out of a finite set of alphabetic letters is
infinite whether it's Latin, Greek, Cyrillic, Indic or Korean.


> (what if the cluster jamo's were dropped?)

   It doesn't make any difference at all. Cluster Jamos can be
represented as well by a seqeunce of basic Jamos.  Please, note that
the most generic form of Hangul sequence is given as

   L+V+T*M?

where L, V, T, and M denote leading consonant, vowel, trailing
consonant and combining mark(for Hangul, it's most likely to be
one of two tone marks and '+', '*', '?' have their usual meanings
in RE.

That's why I wrote that cluster Jamos shouldn't have been encoded at all.
The same is true of all those 11,172 precomposed syllables. For Korean
Hangul, all we need are about a few dozens of basic Jamos. I feel 'guilty'
(although I haven't been involved in any way forcing them through)
that Korean Hangul took about a fifth of BMP codespace when about
two hundredth of that is enough.

> Are we heading for a long-run scenario, where Form-D becomes canonical,
> and all the old pre-composed codepoints are deprecated? NF-C seems
> to be getting more and more entrenched from what I can tell...

  Well, from the very beginning, UTC didn't want to have precomposed
forms in Unicode. Precomposed characters are not there because they wanted
to encode them but because they had to maintain 'compatibility' with
legacy coded character sets in which they're encoded as seprate entitites.
If they had been able to start afresh without any concern for
legacy character sets, there would have been NO precomposed
characters that can be represented by sequences of base characters
and combining characters.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
RE: filename and normalization (was gcc identifiers)

Reply via email to