Towards handling CJK style variants in UTF-8 xterm

Markus Kuhn Thu, 08 Feb 2001 07:49:49 -0800
A few (primarily Japanese) contributers have repeatedly and very vocally
complained that the replacement of currently used various ISO-2022-??
coding conventions with UTF-8 will destroy the typography quality
currently available to them for multi-lingual CJK text processing.

This text shows a solution.

I'd first like to point out that ISO 2022 was never intended to be a
font style or language tagging mechanism and therefore should not play
any role in future UTF-8 based implementations.

It seems indeed desirable to extend xterm to handle additional font
styles to the currently supported variants "normal" and "bold" and a
stateful encoding of font styles has been common implementation practice
for a long time. For instance, I myself would be *very* interested in
seeing at least "italics" being implemented as a style variant,
preferably by implementing the already existing ISO 6429 SGR control
sequence (ESC [ ... m) for it.

Similarly, xterm could also be extended to allow applications and
plaintext files to set the preferred CJK style variant. Examples for two
possible mechanisms for that might be

  - an extension of the ISO 6429 SGR control sequence (ESC [ ... m) to
    add Kanji just as another style variant like italics

  - utilization of Unicode Plane 14 language tagging information

I will conclude that the Plane 14 tagging is probably the best available
approach to be used by terminal emulators, but to get there let's look
at a broader set of options first:

There are various ways of specifying CJK font styles:

  - UCS source groups [G,T,J,K,V]
  - countries/standards bodies (ISO 3166)
  - languages (ISO 639)
  - language+country pairs
  - independent style names ("kanji", etc.)

ISO 10646-1:2000(E) defines in section 27 (page 304) the five groups of
East Asian source standards within which characters from previously
existing standards have not been unified. These five groups are
identified by the letters G, T, J, K, and V.

It is important to understand that any ISO-2022-?? encoded CJK ideograph
can be converted into a tuple consisting of a UCS code and one letter
out of G,T,J,K,V *without* any loss of information.

There is a trivial mapping from the five UCS unification groups to the
national standard bodies responsible for the source standard in each
group, and to their respective country and language:

  G -> GB       -> China    CN -> Chinese          zh_CN
  T -> TCA-CNS  -> Taiwan   TW ->  " / Taiwanese   zh_TW
  J -> JIS      -> Japan    JP -> Japanese         ja
  K -> KS       -> Korea    KR -> Korean           ko
  V -> TCVN     -> Vietnam  VN -> Vietnamese       vi

So for guaranteed ISO 2022 -> UCS -> ISO 2022 round-trip compatibility
of CJK ideographs, source standard country tagging of UCS plaintext
might seem to be the cleanest approach at first.

On the other hand, for other processing requirements (paragraph
formatting, spell checking, etc.), language tagging (with an optional
country distinguisher only where really necessary) seems far more
appropriate, and this is the route that various standards (Unicode
Plane 14, HTML, OpenType, etc.) have taken therefore.

Separate mechanisms for specifying both source standard country and
language tagging seem redundant and undesirable.

Language tagging is perhaps not perfectly suited to distinguish CN
versus TW (both countries speak/write Chinese), but it also was my
understanding that there are no critical style differences between these
two unified source standard groups. (If not, please quote UCS codes in
which the G and T glyphs differ substantially.) Language tagging makes
things on the other hand far easier for the many countries that share a
common language (KR/KP, etc.) and the associated typographic
conventions.

We had last year a lengthy discussion on [EMAIL PROTECTED] about whether
a country code or a language code should go into the ADD_STYLE_NAME
field of an XLFD for style families of CJK fonts in XFree86 4.0. We
ended up using a language code, not a country code. There was one very
vocal contributer to the discussion, Nozomi Ytow <[EMAIL PROTECTED]>,
who strongly argued instead in favour of using a country code.
The general preference seemed to be with the language code, as using
country codes seemed too centered just around the idea of just
distinguishing the above five unification groups without taking care of
other utilizations of text tagging information. Adding a layer of
indirection from naming a font style to the style names with which
the text is tagged was also considered to be an unnecessary and
undesirable complication.

Xterm does at the moment not yet have an ESC sequence to set a language
code for the currently preferred font style and mechanisms to record
this style with each glyph. It seems perfectly feasible to either extend
the ISO 6429 SGR control sequence for that purpose (which would have to
be done using a numeric code to remain within the ISO 6429 syntax and is
therefore highly problematic!) or to interpret Plane 14 language tags
for that purpose (which allow to use the ISO 639-1/ISO 3166-1 alpha-code
syntax that is already widely known from the locale names and HTML).

With merely the five Plane 14 language tags

  zh-cn, zh-tw, ja, ko, vi

added and interpreted, I believe that multi-lingual UTF-8 processing
software would become functionally fully equivalent to existing ISO 2022
software such as Mule without the horrendous complexity and functional
restrictions (searching, etc.) commonly associated with ISO 2022. As far
as display requirements are concerned, even fewer tags might be
sufficient, as I suspect that for instance the distinction between zh-cn
and zh-tw is not even necessary unless full ISO 2022 round-trip
compatibility is required for communication with legacy applications
(CTEXT, etc.).

http://www.unicode.org/unicode/reports/tr7/

More details on what to do in xterm will follow in a separate posting.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Towards handling CJK style variants in UTF-8 xterm

Reply via email to