Re: Unicode, character ambiguities

Edward Cherlin Thu, 10 Jan 2002 01:32:16 -0800

On Tuesday 08 January 2002 08:58 pm, you wrote:
> Followup to: 
> <[EMAIL PROTECTED]> By
> author:    [EMAIL PROTECTED]
> In newsgroup: linux.utf8
>
> > > Character Set Encoding of Tags:
> > > ===============================
> > >
> > > UTF-8 is the default encoding for tag data.  Unfortunately
> > > UTF-8 muffed it for Asian languages by doing the equivalent of
> > > giving the same character codes to English, Russian, and Greek
> > > letters.


I have followed this debate for several years. I think the Japanese 
experts in the Ideographic Rapporteur Group did a wonderful job on 
Han Unification. Even the Japanese anti-unification stalwarts admit 
that 

the characters they find troublesome would not have been considered 
difficult by anyone educated before 1950,

there are only a small number of characters of concern, primarily 
U+76F4 (Mathews 1004, Nelson 775) and characters containing it, and

the problem only arises when multilingual plain text is displayed in 
an inappropriate font. 

Since the author can control the font in formatted text, and the user 
can control the font in viewing plain text, I don't see the problem. 
(To which opponents of unification reply, "_That_'s the problem", so 
we don't get anywhere.)

On the other hand, I have never heard anyone other than a 
mathematician complain about the unification of Fraktur and other 
writing styles for the Latin alphabet, even though Fraktur is 
extremely difficult for Americans and even younger Germans to read. 
Fraktur will evidently be disunified in Unicode 4.0 for use in 
variable names in math, but _not_ for text.

> > It's interesting that Japanese and Chinese, which are unrelated
> > languages, are sometimes mutually understandable when written,
> > but somehow use totally different scripts.

Almost all Japanese characters came from China, although there are a 
few indigenous creations such as the character for mountain pass 
(touge) and a number of simplified characters that appeared first in 
Japan. Kana is of course a purely Japanese invention.

> Also, there would hardly have been any hurt feelings if U+0065
> (Latin), U+0391 (Greek), and 0+0410 (Cyrillic) had been unified. 
> It would just not have saved enough code points to bother.

Actually, it was impossible because of the source separation rule 
applied to Chinese, Korean, and Japanese encodings, including Big 
Five, GB2312, KSC, and JIS. (How ironic.) These standards include 
various combinations of Latin, Greek, and Cyrillic alphabets in 
separate code blocks alongside Hanzi, Zhuyin, Hangul, and kana. So 
LATIN CAPITAL LETTER A, CYRILLIC CAPITAL LETTER A, and GREEK CAPITAL 
LETTER ALPHA cannot be unified without breaking round-trip conversion 
for these standards. 

On the other hand, Kurdish Cyrillic Q has been unified with Latin, 
since there was no pre-existing character set standard containing 
both at separate code points. There is, for example, no 
KOI-8-K(urdish).

> As far as "English" and "Russian" are concerned, the various
> Latin-script and the various Cyrillic-script languages have been
> unified for a long, long time.

Do you mean each group separate from the other? Otherwise I can't 
make sense of this.

> Things like U+212A and U+212B should never have been allowed to
> happen, on the other hand, IMNSHO.

(KELVIN SIGN and ANGSTROM SIGN)

Source separation rule, again. We're stuck with them in the standard, 
but we don't have to use them, or any of the other Compatibility 
Characters and Presentation Forms. Anyway, the true worst case was 
the encoding of more than 11,000 Hangul syllables, none of which is 
required.

>       -hpa

Better an imperfect but entirely usable standard than no standard.

-- 
Edward Cherlin
[EMAIL PROTECTED]
Does your Web site work?
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

Reply via email to