Re: Unicode

Xianping Ge Thu, 24 Aug 2000 22:19:10 -0700
Yves Dorfsman writes:
>
>On Thu, 24 Aug 2000, Xianping Ge wrote:
>
>> I have a working patch against rxvt-2.7.3 to enable
>> multibyte-char, e.g., Unicode (UTF-8), GBK, Big5, etc:
>> 
>>     http://www.ics.uci.edu/~xge/clinux/rxvt/
>
>Thanks, I'll have a look.
>
>I thought Big5 was already supported. I have crxvt 2.6.2 that comes
>standard with Debian, and it seems to work fine (I can read files, that I
>have created with cxterm-big5).

Yes, Big5 was already supported, and very stable (I had been using it before
my 'multibyte-char' patch). My patch attempts to unify the code for 
  (1) ASCII (or other single-byte encodings) only, 
  (2) Big5, GBK, etc; 
  (3) UTF-8, which was not supported by rxvt,
and here's my motivation:
  1. The current implementation of Big5 support (#define MULTICHAR_SET) uses
     different code for (1) ASCII only, and (2) Big5 etc.
  2. "MULTICHAR_SET" hard-coded the fact that in encodings like Big5,
     a non-ASCII character has exactly 2 bytes, and is exactly 2 columns wide.
     This is almost impossible to be extended to accomodate UTF-8.
  3. So, instead of yet another separate code blocks for UTF-8, I'd rather
     have code general enough to handle (1) ASCII/single-byte only,
     (2) Big5 etc, and (3) UTF-8 uniformly. The basic idea is:
         - multi-byte char:
           Represent each character using a wide char (u_int_8, or u_int_16,
           or u_int_32) big enough to contain all the characters in the
           encoding.
         - multi-column:
           On screen, the width of characters can be 1 column, 2 columns, ...
     This representation is different from current representation in rxvt where
     byte is synonymous with column: one byte in display buffer is exactly one
     column on screen.
  4. Currently, I am using 
          #ifdef MULTIBYTE_CHAR
             ... my modified code ...
          #else
             ... original code ...
          #endif
     to separate my modification from original code in rxvt. Because my
     modified code covers the case of ASCII/single-byte, I suggest simply
     replacing the whole '#ifdef...#else...#endif' with 
     '... my modified code ...', if I can convince other people to change
     to my 'multi-byte, multi-column' representation (from one-byte =
     one-column representation).
    
>
>> The most URGENT thing on the to-do list is to write
>> some real code to classify a Unicode character as single-width (e.g., ASCII)
>> or double-width (e.g., Chinese characters).
>
>I might be mistaking here, but my understanding is that the chinese
>characters are coded on three bytes in UTF-8.

Here the width refers to the width of the glyph (one column or two columns);
so, on the screen, a Chinese character is as wide as two ASCII characters.

>> Currently, I make everything > 255
>> as double-width, which is obviously wrong. Need to read through Unicode
>> documents to get this right. When this is done (and most bugs removed), I'll
>> propose the patch to be merged.
>
>Anything I can help with, or do you have most everything under control ?

Tomohiro KUBOTA <[EMAIL PROTECTED]> pointed to my_wcwidth() in xterm-utf8.
It uses Markus Kuhn's implementation of wcwidth() function 
( http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c ). So the problem
with width will soon be solved.

Another item on the to-do list is IM (input method) support; I suppose existing
IM code (for MULTICHAR_SET) can be readily re-used.

 -- Xianping
 [EMAIL PROTECTED]
Re: Unicode

Reply via email to