Re: ASCII and JIS X 0201 Roman - the backslash problem

Jungshik Shin Sun, 12 May 2002 00:42:37 -0700

On Sat, 11 May 2002, Markus Kuhn wrote:

> I have found some ways of lobbying for specific technical issues
> within Microsoft and sometimes manage to get directly in contact
....
> I'd be happy to add the yen/backslash issue to this list.


  Exactly the same problem exists for Korean Won/backslash.  KS X 1003
(ISO 646-KR) has Won at 0x5c just like JIS X 0201 (ISO 646-JP) has
Yen at 0x5c.  MS fonts for Korean has WonSign at U+005C instead of
backslash, which really annoys MS-Windows TeX users  among others.
Those MS fonts have another (half-width) Korean Won sign at U+20A9 as
well as full-width sign at U+FFE6. Given these, it may help you get your
suggestion crossed to MS people that you you take up two problems in
'a single stroke'. (Looking into SimSun and  MingLiu
for TC and SC, I found that zh locales don't have this problem.)

  BTW, ko_KR locale definition in glibc 2.2.x has to use
U+FFE6 in LC_MONETARY because KS X 1001 used in EUC-KR doesn't have a
character corresponding to U+20A9. Of course, we don't have
to if everybody uses UTF-8 locales  exclusively.


> However, I will need someone who writes me a detailed report and
> analysis of this issue and presents a well-formulated case for
> why current practice is wrong, what exactly should be changed,

  I'm sorry I can't help you much with this because I  know
only as much about Japanese situation as you know. However, I can give
you a suggestion as to how to solve this problem for Japanese and Korean
keyboards/IMEs. I'm not saying this is all that has to be done, but this
can be a part of what need to be done. Japanese and Korean users have to
switch between Japanese(Korean) input mode and English input mode(think
of them as two keyboard groups in Xkb). In English input mode, the key
labelled with 'vertical bar and Yen(Won)' should produce backslash(U+005C)
whereas in Japanese(Korean) input mode it should generate Yen(U+00A5)
and Won(U+20A9). Japanese and Korean keyboards for new computers should
have *three* characters marked on that key (perhaps Yen and Won sign
in a different color than other two characters to indicate that it can
only be entered in Japanese and Korean input mode.) Japanese and Korean
IME also have full-width mode in which pressing the key should produce
fullwidth Yen and Won.(In this scheme, the fullwidth backslash can't be
entered, but who needs it? If one really wants to enter it, one can use
the codemap or something like that.)  It may take some getting used on
the part of Japaense and Korean users who got used to embed the directory
separator between Japanese and Korean path names, but not many people
do that under Windows (they just drag'n' drop, click, etc...)

  Now somebody might raise an objection to this because Shift_JIS and
CP949 (extension of EUC-KR used in MS-Windows) don't have U+00A5 and
U+20A9. They'd say that with this change, all of sudden emails and
html files encoded  can't include Yen and Won. For html files, this
is not a valid objection because no matter what eccoding is used, one
can always use NCRs to inclde any character in Unicode.  Web authoring
tools should take care of this problem. Simple text editors should
warn users that files to be saved into Shift_JIS or CP949 include
characters not representable in those encodings and they're about to
be replaced with something like '\u00ac' (or \u20a9). For emails in
plain text in legacy encodings, they can use the fullwidth Yen and Won
(a smart email program would do that if users insist on using Shift_JIS
and CP949/EUC-KR. Otherwise, it can send emails in UTF-8).

  As for existing web pages and documents, I don't know
what's the best solution except that as time goes by people will gradually
convert them as necessary. It'll be great if they go all the way to UTF-8
(or other UTF's). If not, at least they can use NCRs in html files.
As others have written, this conversion needs some form of 'AI'(?),
but I guess there are not many documents '0x5c' doubles as
the directory separator(or  escape characters) and Yen/Won.

  I'm not sure whether this will help the transition or not. However,
as an interim measure, MS and foundries could make TTC (truetype
collection) for Japanese and Korean have two variants, one with backslash
at U+005C and the other with Yen/Won at U+005C.  It would increase the
size of TTC  by ~100 bytes(well, it could be a few kBs, but it doesn't
really matter because Japanese and Korean TTC's are usually well over 1MB)
because two variants share all the glyphs except for the one for U+005C.
Alternatively, truetype gsub(?) table entry for U+005C can be made use of.
Perhaps, to 'promote' the transition, the default should be the one with
backslash at U+005C and the other variant should have a special marker
attached to its name (say, '$'). This convention is not my invention. '@'
at the beginning of Korean font names (and I believe this is also the
case of Japanese fonts) denote variants for vertical writing.  I don't
know whether this is just a convention used in MS-Windows or is a part
of truetype or opentype spec. (the latter is not likely)

  Before sending this off, I'm gonna add another data point on this
issue. Some Korean Unix/Linux users got used to interpret 'backslash' as
Won when viewing Korean web pages because most, if not all, X11 bdf fonts
(and some truetype fonts made for Unix/Linux users) have backslashes(well,
in case of BDF fonts, they're iso-8859-x fonts so that they should).
This is the opposite of what TeX users under MS-Windows got used to.
However, this may not be the case of Japanese Unix/Linux users because
there are JIS X 0201 fonts. (there's no KS X 1003 font.)

   Hope this helps you a little with raising the issue with MS,

   Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: ASCII and JIS X 0201 Roman - the backslash problem

Reply via email to