Re: Towards handling CJK style variants in UTF-8 xterm (fwd)

Thomas Chan Thu, 08 Feb 2001 10:19:49 -0800

------- Forwarded Message
Date: Thu, 8 Feb 2001 12:30:28 -0500 (EST)
From: Thomas Chan <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Re: Towards handling CJK style variants in UTF-8 xterm (fwd)

Hi Markus,

It seems that nl.linux.org is using ORBS, and rejecting me as of this
morning, so I'm sending this directly to you.  Could you post it for me?

Now, to find someone who can fix this for me... and maybe resend some of
my other messages to linux-utf8.


Thomas Chan
[EMAIL PROTECTED]


- ---------- Forwarded message ----------
Date: Thu, 8 Feb 2001 12:22:18 -0500 (EST)
From: Thomas Chan <[EMAIL PROTECTED]>
Reply-To: Thomas Chan <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Re: Towards handling CJK style variants in UTF-8 xterm

On Thu, 8 Feb 2001, Markus Kuhn wrote:

> There are various ways of specifying CJK font styles:
> 
>   - UCS source groups [G,T,J,K,V]
>   - countries/standards bodies (ISO 3166)
>   - languages (ISO 639)
>   - language+country pairs
>   - independent style names ("kanji", etc.)

First, preliminaries on CJKV countries and their languages:

China (cn)              Chinese (zh)
Taiwan (tw)             Chinese (zh)
Hong Kong (hk)          Chinese (zh)
Singapore (sg)          Chinese (zh)
Japan (jp)              Japanese (ja)
South Korea (kr)        Korean (ko)
North Korea (kp)        Korean (ko)
Vietnam (vn)            Vietnamese (vi)

We should be careful in using these and other terms, as "Chinese" can
refer to a country or language, as can "Taiwanese"; "Korean" commonly
refers solely to the South Korean situation, etc.  I will use codes
explcitly below.


[by ucs source groups] 
> ISO 10646-1:2000(E) defines in section 27 (page 304) the five groups of
> East Asian source standards within which characters from previously
> existing standards have not been unified. These five groups are
> identified by the letters G, T, J, K, and V.
> 
> It is important to understand that any ISO-2022-?? encoded CJK ideograph
> can be converted into a tuple consisting of a UCS code and one letter
> out of G,T,J,K,V *without* any loss of information.
> 
> There is a trivial mapping from the five UCS unification groups to the
> national standard bodies responsible for the source standard in each
> group, and to their respective country and language:
> 
>   G -> GB       -> China    CN -> Chinese          zh_CN
>   T -> TCA-CNS  -> Taiwan   TW ->  " / Taiwanese   zh_TW
>   J -> JIS      -> Japan    JP -> Japanese         ja
>   K -> KS       -> Korea    KR -> Korean           ko
>   V -> TCVN     -> Vietnam  VN -> Vietnamese       vi

It's not so trivial anymore.

There's an H- source now for Hong Kong SAR (HK), unlike the case of
Singapore (SG), which is combined with China (CN) as a G- source.

The K- sources include both South Korea's (KS) "KS X" standards and North
Korea's (KP) "PKS" standards.  (I am not completely sure about this
situation--the Unihan.txt give K0 and K1 as South Korean standards, and K2
and K3 as North Koraen standards; however, there also seems to be a KP-
source now for North Korean standards.[2])

They may not be extant in current publications, but are in use among the
documents at the IRG website.[1]  e.g., N777 "CJK Unified Ideographs
Extension B DIS For ISO 10646-2:2000" cover note (2000.12.20) [2] at the
IRG website.

[1] http://www.cse.cuhk.edu.hk/~irg/
[2] http://www.cse.cuhk.edu.hk/~irg/N777_CJK_B_CoverNote.pdf
 

[by language+country pairs]
> Language tagging is perhaps not perfectly suited to distinguish CN
> versus TW (both countries speak/write Chinese), but it also was my
> understanding that there are no critical style differences between these
> two unified source standard groups. (If not, please quote UCS codes in
> which the G and T glyphs differ substantially.) Language tagging makes
> things on the other hand far easier for the many countries that share a
> common language (KR/KP, etc.) and the associated typographic
> conventions.

China (CN) and Taiwan (TW) (as well as Hong Kong (HK) and Singapore (SG))
would certainly require language tagging for "spellchecking", 
grammar/usage checking, punctuation, etc purposes.  But as for glyph
differences, while they do exist between China (CN) and Taiwan (TW) (such
as the commonly given U+9AA8 'bone' example), most users do not object as
strongly as the Japanese objections.  (Hong Kong (HK) and Singapore
(SG) tend to follow Taiwan (TW) and China (CN), respectively.)  On the
other hand, China (CN) does approve fonts[3], which might be an additional
complication.

[3] http://www.dynalab.com.hk/font/Bitmapfont.htm ; see "Bitmap font
certificate" link.

South Korea (KR) and North Korea (KP) would require language tagging for
different spelling conventions, segmentation of orthographic words,
sorting, etc.  See p. 442 of Ken Lunde's _CJKV Information Processing_ for
sorting differences.  Other differences can be found in various linguistic
descriptions of Korean writing.  (Information on North Korean (KP)
practice isn't easy to come by, however.)

 
> With merely the five Plane 14 language tags
> 
>   zh-cn, zh-tw, ja, ko, vi
> 
> added and interpreted, I believe that multi-lingual UTF-8 processing
> software would become functionally fully equivalent to existing ISO 2022

Those five are the most common implementation today; I think they are
sufficient for *today*.

Hong Kong (zh_HK) can usually follow most of Taiwan's (zh_TW) settings,
while Singapore (zh_SG) can usually follow most of China's (zh_CN)
settings.

It still remains to be seen how to handle Korean (ko) in the future, as
North Korea (kp) is a latecomer, and most existing software has
been geared towards South Korean usage (ko_KR).


Thomas Chan
[EMAIL PROTECTED]

------- End of Forwarded Message


-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Towards handling CJK style variants in UTF-8 xterm (fwd)

Reply via email to