Tomohiro KUBOTA wrote on 2001-04-10 01:26 UTC:
> Sure, IMO, character width problem and accompanying conversion table
> problems are one of most urgent problems if Japanese people think of
> using Unicode for serious daily business, not for Geek's hobby. In
> "legacy" CJK encodings, width of each character is very cleary
Yes, in legacy CJK encodings, the width of many characters is very
clearly completely inappropriate. I can't imagine that CJK users really
want to have square Greek and Cyrillic characters and similar nonsense.
It is not a problem of Unicode to allow us to select character widths
properly as opposed to determined by the number of bytes used in the
multi-byte encoding.
The only characters for which double-width (square) is appropriate are
- Han ideographs
- Hiragana/Katakana
- Hangul
- CJK punctuation
- fullwidth forms
and all these are doublewidth in the currently proposed wcwidth scheme:
return 1 +
(ucs >= 0x1100 &&
(ucs <= 0x115f || /* Hangul Jamo init. consonants */
(ucs >= 0x2e80 && ucs <= 0xa4cf && (ucs & ~0x0011) != 0x300a &&
ucs != 0x303f) || /* CJK ... Yi */
(ucs >= 0xac00 && ucs <= 0xd7a3) || /* Hangul Syllables */
(ucs >= 0xf900 && ucs <= 0xfaff) || /* CJK Compatibility Ideographs */
(ucs >= 0xfe30 && ucs <= 0xfe6f) || /* CJK Compatibility Forms */
(ucs >= 0xff00 && ucs <= 0xff5f) || /* Fullwidth Forms */
(ucs >= 0xffe0 && ucs <= 0xffe6) ||
(ucs >= 0x20000 && ucs <= 0x2ffff)));
http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
I think, it is very straight forward and unambiguous. Where is the
problem?
I understand, that there are also the block graphic characters. If you
live in a world where you use mostly double-width glyphs in terminal
emulators, it might be convenient to also have double-width block
graphics characters. The answer of the Unicode consortium is very simple
here: Nobody should be using the block graphics characters anyway. their
use is deprecated, and they are only in Unicode to guarantee round-trip
compatibility with legacy sets. In modern display systems (such as
HTML), you have appropriate alternative means such as table constructs
to do what you used to use block graphics for.
[There is also the minor issue of U+300a, U+300b, U+301a, U+301b, four
mathematical brackets that somehow ended up in the CJK section probably
because the Unicode/UCS authors were initially not aware of the
mathematical origin of these doublestroke parenthesis and brackets. Do
these have any non-mathematical use whatsoever in CJK typography?
Otherwise, they clearly must be normal width.]
The solution for the width problem that some CJK people might see is
ultimately to avoid characters such as the block graphics symbols, where
the width matters. If you want to draw a line in a text, use proper
graphical primitives for that, not block graphics symbols. Some beloved
habits will have to change, there is no way past that.
So I don't see any further work that has to be done here to address any
of the problems that you mention, except that you have to get used to
the fact that plain text layout is going to change after EUC->UTF-8
conversion and that this is not a bad thing and that formatted CJK
plaintext authors will have to take this into consideration in the
future.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/