Re: en_US.utf8/XLC_LOCALE bogus?

Jungshik Shin Mon, 29 Apr 2002 13:59:35 -0700


On Sun, 14 Apr 2002 [EMAIL PROTECTED] wrote:

> On Fri, 12 Apr 2002, Hideki Hiura wrote:
>
> > > From: Owen Taylor <[EMAIL PROTECTED]>
> > For example, here is the one used in Solaris for en_US.UTF-8 locale,
> > which I think is virtually identical with the one in X.Org's X11R6.6.x.
>
>   en_US.UTF-8 in Solaris below  includes ksc5601.1992-3 (JOHAB) and
> you wrote that it's virtually identical to the one in X.Org's
> X11R6.6.x. Does it mean that JOHAB (ksc5601.1992-3) support has been added
> to X11R6.6.x ?

  I pulled out a part of the source of X11R6.6 reference implementation
at X.org and it seems like support for JOHAB is there.

> > fs10        {
> >     charset KSC5601.1992-3:GLGR
> >     font    {
> >             primary KSC5601.1992-3:GLGR


>   When I submitted the font encoding file for ksc5601.1992-3 to include
> in XF86, Juliusz and I talked briefly about including ksc5601.1992-3
> support (beyond just being able to present truetype fonts as in
> ksc5601.1992-3 font encoding with freetype moudle), but we concluded (or
> rather, he suggested) that we don't have to because iso10646-1 will do
> the job, instead. However, if we follow Owen's suggestion quoted below,

  I tried  several variants of XLC_LOCALE definitions for ko_KR.UTF-8
and what I learned for certain is that XF86 4.2 doesn't support
ksc5601.1992-3 aside from being able to package TTF's as in that font
encoding.

> I think we'd better have ksc5601.1992-3 support in XF86 as well.

  Now I'm less sure if it means a whole lot of code.


> Owen> And for other locales (CJK languages), we should have separate UTF-8
> Owen> XLC_LOCALE files that list the language's encoding first, followed
> Owen> by 10646-1 afterwards.

  My test results showed me that this isn't going to work for
Korean unless ksc5601.1992-3 support is in place because ksc5601.1987-0
has only 2350 Hangul syllables. There's very little point in using
ko_KR.UTF-8 if the character repertoire (as far as Hangul syllables
are concerned) would be the same as ko_KR.EUC-KR.  Alternative to
listing legacy nat'l character sets before iso10646-1
is to make use of 'add-style' field of XLFD to label CJK fonts as such
and to explicitly specify 'lang' (ja, zh_TW, zh_CN, ko) in fontsets
for various applications, desktops, etc.

 A couple of changes to be made in XLC_LOCALE for en_US.UTF-8
before being 'recycled' for XLC_LOCALE in CJK UTF-8 locales
are:

  1. iso10646-1 and iso8859-1 should be followed by  cs and
     fs of the national character set of a target country/region.
     That is, in XLC_LOCALE for zh_CN.UTF-8, gb2312.1980-0
     should come BEFORE
     jisx0208.1983-0, ksc5601.1987-0 and big5. In ko_KR.UTF-8,
     ksc5601.1987-0 should come before jisx0208.1983-0,
     gb2312.1980-0 and big5. Otherwise,  characters common
     in these national character sets would be 'labeled'
     as in the *first* character set listed in XLC_LOCALE
     in CompoundText. This leads applications running in
     locales with legacy encodings (GB2312/EUC-CN, EUC-KR, etc)
     to silently rejcet those characters when they're
     handed over in CompoundText.  For instance, U+4E00 ('one')
     is in all CJK character sets. If jisx0208.1983-0 is listed
     *before* ksc5601.1987-0 in XLC_LOCALE for ko_KR.UTF-8,
     an application running under ko_KR.UTF-8 cannot send
     the character to an application running under ko_KR.eucKR
     locale because U+4E00 would be encoded as

        ESC $ B 30 6C    ( 0x30 0x6C : U+4E00 in JIS X 0208 GL)

     instead of

        ESC $ ( C 6C 69  ( 0x6C 0x69 : U+4E00 in KS C 5601 GL)

     Of course, this would not
     be an issue if UTF8_STRING is used. However, I don't
     know how to get XIM servers to try UTF8_STRING (I've been
     modifying Ami, Korean XIM to work in ko_KR.UTF-8)
     before falling back to COMPOUND_TEXT.

     BTW, I think this would be also an issue for locales like
     hu_HU. ISO-8859-2 should be listed before ISO-8859-1:GR in
     hu_HU.UTF-8  to avoid losing characters in ISO-8859-1:GR as well as in
     ISO-8859-2 when cut'n'pasting from an application
     running under hu_HU.UTF-8 to an app. under hu_HU.ISO8859-2.

     I don't know if there's any standard that UTF-8 should be
     considered as the last resort in making up CompoundText.
     I found (while testing my patch to make Korean input
     method server Ami work in ko_KR.UTF-8 locale. It's more
     or less complete and now I can use it to enter the full
     set of Korean syllables in Unicode. The patch is at
     http://jshin.net/faq/ami-1.0.11.utf8.patch.gz) that UTF-8 is
     not used in CompoundText encoding unless it's absoultly
     necessary even if iso10646-1 is the first entry in XLC_LOCALE.
     This is rather nice!!  For example, U+AC02 (Hangul Syllable GGAGG)
     is not in KS C 5601 while U+AC00 is in. When I type 'U+AC00 and
     U+AC02' in succession,  Ami (modified to work under ko_KR.UTF-8
     locale) sends the following compound text string to a client.

        ESC $ ( C 30 21 ESC % @ ESC % G EA B0 82 ESC % @

     ( where '30 21' is for U+AC00 in KS C 5601 GL and 'EA B0 82'
       is for U+AC02 in UTF-8)

   2. ISO-8859-x's other than ISO-8859-1 and other single byte character
      sets should be placed before (or depending on user preference) any
      multi byte character sets to work around a width problem.

Now back to specifying 'lang' code in add-style field
of XLFD,  I have the following lines  in fonts.dir where
Korean baekmuk truetype fonts are installed.


---------
gulim.ttf  -baekmuk-gulim-medium-r-normal-ko-0-0-0-0-c-0-iso10646-1
gulim.ttf  -baekmuk-gulim-medium-r-normal-ko-0-0-0-0-p-0-iso10646-1
batang.ttf  -baekmuk-batang-medium-r-normal-ko-0-0-0-0-c-0-iso10646-1
batang.ttf  -baekmuk-batang-medium-r-normal-ko-0-0-0-0-p-0-iso10646-1
gulim.ttf  -baekmuk-gulim-medium-r-normal-ko-0-0-0-0-c-0-ksc5601.1992-3
gulim.ttf  -baekmuk-gulim-medium-r-normal-ko-0-0-0-0-p-0-ksc5601.1992-3
batang.ttf -baekmuk-batang-medium-r-normal-ko-0-0-0-0-c-0-ksc5601.1992-3
batang.ttf -baekmuk-batang-medium-r-normal-ko-0-0-0-0-p-0-ksc5601.1992-3
.......similar lines for ksc5601.1987-0
----------

With gtkrc.ko_KR.utf8 shown below, gnome applications work pretty
well under ko_KR.UTF-8 locale.

---------- /etc/gtk/gtkrc.ko_KR.utf8
style "gtk-default-ko-kr-utf8" {
   fontset =
"-*-*-medium-r-normal-ko-14-*-*-*-p-*-iso10646-1,\
              -*-*-medium-r-normal-*-14-*-*-*-*-*-*-*"
}
class "GtkWidget" style "gtk-default-ko-kr-utf8"
---------

--------  ~/ko_KR.UTF-8/Xedit
*fontSet: \
 -*-*-medium-r-normal-ko-14-*-*-*-p-*-iso10646-1,\
 -*-*-medium-r-normal--14-*-*-*-*-*-iso10646-1,\
 -*-*-medium-r-normal--14-*-*-*-*-*-*-*
*international: True
*inputMethod: Ami
---------

If I use '*' in place of 'ko', I just have to pray that Korean
iso10646-1 font is picked up.


Summing up, I have two suggestions:

  1. In XFree86, XLC_LOCALE files for ll_CC.UTF-8 have to
     be *taylored* for each ll_CC so that CompoundText
     works between applications running under ll_CC.Legacy
     and ll_CC.UTF-8.

  2. To work around a nasty width problem, we have to take
     advantage of 'add-style' field for iso10646-1 fields
     to specify 'lang' of fonts. This should be done
     on a few fronts in cooperation: font developers/package
     builders and application/desktop developers.

  Jungshik Shin


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
Re: en_US.utf8/XLC_LOCALE bogus?

Reply via email to