Hi,
Is anyone currently working on extending rxvt for UTF-8? Recently I
did some quick (and very dirty) hacking of the rxvt-2.7.3/ code to get
an idea of how to extending rxvt for UTF-8. Based on this experience,
here's my proposal for rxvt-utf8; if some people have already been
working on this, I'd be happy to join the effort.
- Internally, store each Unicode character as a 16-bit number:
typedef typedef u_int16_t text_t;
(In the future, we can change to u_int32_t to accomodate 32-bit Unicode
characters.)
For each row, we still allocates TermWin.ncol text_t's:
sizeof(text_t) * TermWin.ncol
- Assume that the fonts are mono-spaced (fixed width), i.e., each character
takes 1 column (e.g., ASCII), or 2 columns (e.g., CJK characters), ...
(I'm not sure about proportional fonts.)
For d-column characters where (e.g., for CJK characters, d=2),
if its first column is at screen position (row, col), then
screen.text[row][col] is its Unicode value,
and the succeeding (d-1) positions in screen.text[row][]
(screen.text[row][col+1], ... screen.text[row][col+d-1])
are ignored, or cleared to be ' ' or '\0'.
Use a RS_ bit to denote the succeeding (d-1) columns are ``trailing"
columns:
i: 1..d-1
screen.rend[row][col+i] |= RS_trailing_column[row][col]
This scheme does not affect the single-column characters (e.g., ASCII).
- Change `tlen' in screen_t to mean:
the length of the line (in columns),
or (-1) * (the length of the line (in columns)), for wrapped lines
The reason for this change is:
when there is only one column at the end of a row, if we write a
d-column character with d > 1, we should write it at the beginning
of the next row. So the length of a wrapped line may be anywhere
between TermWin.ncol and TermWin.ncol-(d-1).
- At the beginning of ``scr_add_lines()", convert the input string from UTF-8
format (unsigned char) to 16-bit integers (text_t).
- In ``selection_make()", convert the selected characters from 16-bit
integers (text_t) to UTF-8 (unsigned char).
- When iterating over the characters of a row, use a loop like this:
for (; col < end_col; col++)
if (!(screen.rend[row][col] & RS_trailing_column))
*str++ = screen.text[row][col];
- In ``scr_refresh()", ``buffer" contains the string to be drawn by
draw_string, etc. (here for Unicode,
draw_string = XDrawString16;
draw_image_string = XDrawImageString16;)
Change its type from
static char *buffer = NULL;
to
static text_t *buffer = NULL;
Because XDrawString16 needs XChar2b, on little-endian machines (e.g., x86),
we need to swap the two bytes when copy a 16-bit character to buffer.
Problems to be resolved:
- fixed vs. proportial font for Unifont:
I'm using the Unifont (GNU Unicode Font, http://czyborra.com/unifont/)
in which are 8x16 and 16x16 glyphs.
This mixture of 8x16 and 16x16 glyphs are classified as proportional
font by rxvt. On the other hand, we cannot force rxvt to see it as
a mono-space (fixed-width) font, otherwise, it will use 16x16 cells
for 8x16 glyphs.
So, in ``TermWin_t"
fwidth, /* font width [pixels] */
fheight /* font height [pixels] */
we want ``fwidth" to mean the width of one column.
Currently, I have to force fwidth=8, fheight=16. Perhaps we need
some command line option for this?
- How to quickly classify a Unicode character as single-column, 2-column?
In other words, what characters are double-width? Is it safe to simply
say that the characters in the CJK blocks of Unicode are double-width?
Any suggestions would be greatly appreciated.
-- Xianping
[EMAIL PROTECTED]
p.s., Some random thoughts:
- MULTICHAR_SET
The above proposition for Unicode can also work for the MULTICHAR_SET
encodings, as each of the encoding uses a 16-bit code for the CJK
characters.
The existing code for MULTICHAR_SET in rxvt uses space much more
efficiently as only 8-bit is used for each column, whereas the above
proposal uses 16-bit for each column. Many people wouldn't like the
idea of wasting so much memory :-(
But I guess using 16-bit can greatly simplify the code for
MULTICHAR_SET. And by combining MULTICHAR_SET with Unicode (or
extending MULTICHAR_SET to Unicode, if you like), a single
code base needs to be maintainged for multi-width fonts.
- select-copy-paste Tab:
Use a similar idea, we can use two RS_ bits to denote if a column
is the first column (or trailing column) of a tab. This way,
we can correctly select-copy-paste tabs.