Yves Dorfsman writes:
>
>On Thu, 24 Aug 2000, Xianping Ge wrote:
>
>> I have a working patch against rxvt-2.7.3 to enable
>> multibyte-char, e.g., Unicode (UTF-8), GBK, Big5, etc:
>>
>> http://www.ics.uci.edu/~xge/clinux/rxvt/
>
>Thanks, I'll have a look.
>
>I thought Big5 was already supported. I have crxvt 2.6.2 that comes
>standard with Debian, and it seems to work fine (I can read files, that I
>have created with cxterm-big5).
Yes, Big5 was already supported, and very stable (I had been using it before
my 'multibyte-char' patch). My patch attempts to unify the code for
(1) ASCII (or other single-byte encodings) only,
(2) Big5, GBK, etc;
(3) UTF-8, which was not supported by rxvt,
and here's my motivation:
1. The current implementation of Big5 support (#define MULTICHAR_SET) uses
different code for (1) ASCII only, and (2) Big5 etc.
2. "MULTICHAR_SET" hard-coded the fact that in encodings like Big5,
a non-ASCII character has exactly 2 bytes, and is exactly 2 columns wide.
This is almost impossible to be extended to accomodate UTF-8.
3. So, instead of yet another separate code blocks for UTF-8, I'd rather
have code general enough to handle (1) ASCII/single-byte only,
(2) Big5 etc, and (3) UTF-8 uniformly. The basic idea is:
- multi-byte char:
Represent each character using a wide char (u_int_8, or u_int_16,
or u_int_32) big enough to contain all the characters in the
encoding.
- multi-column:
On screen, the width of characters can be 1 column, 2 columns, ...
This representation is different from current representation in rxvt where
byte is synonymous with column: one byte in display buffer is exactly one
column on screen.
4. Currently, I am using
#ifdef MULTIBYTE_CHAR
... my modified code ...
#else
... original code ...
#endif
to separate my modification from original code in rxvt. Because my
modified code covers the case of ASCII/single-byte, I suggest simply
replacing the whole '#ifdef...#else...#endif' with
'... my modified code ...', if I can convince other people to change
to my 'multi-byte, multi-column' representation (from one-byte =
one-column representation).
>
>> The most URGENT thing on the to-do list is to write
>> some real code to classify a Unicode character as single-width (e.g., ASCII)
>> or double-width (e.g., Chinese characters).
>
>I might be mistaking here, but my understanding is that the chinese
>characters are coded on three bytes in UTF-8.
Here the width refers to the width of the glyph (one column or two columns);
so, on the screen, a Chinese character is as wide as two ASCII characters.
>> Currently, I make everything > 255
>> as double-width, which is obviously wrong. Need to read through Unicode
>> documents to get this right. When this is done (and most bugs removed), I'll
>> propose the patch to be merged.
>
>Anything I can help with, or do you have most everything under control ?
Tomohiro KUBOTA <[EMAIL PROTECTED]> pointed to my_wcwidth() in xterm-utf8.
It uses Markus Kuhn's implementation of wcwidth() function
( http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c ). So the problem
with width will soon be solved.
Another item on the to-do list is IM (input method) support; I suppose existing
IM code (for MULTICHAR_SET) can be readily re-used.
-- Xianping
[EMAIL PROTECTED]