Canonical Mode Input Processing with multi-byte character sets

Markus Kuhn Sun, 22 Feb 2004 16:53:21 -0800

[Crosspost from comp.std.unix about the old cooked mode issue, which as
I understand is still not resolved in the latest Linux kernel, possibly
due to unclear standardization.]


The Canonical Mode Input Processing that takes place on POSIX
systems when the ICANON flag in the c_lflag of struct termios
is set supports a control character ERASE that "erases the last
character in the current line".

http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap11.html

In the case of multi-byte character sets, such as the UTF-8 encoding
widely used in recent GNU/Linux distributions, this means that the
function that implements the ERASE action must know what the
encoding is, to be able to remove the correct number of bytes per
character.

How exactly do the authors of the POSIX standard envision, how the
implementation of the ERASE action learns about the character
encoding that it deals with?

The termios structure, which controls every other aspect of the
Canonical Mode Input Processing in great detail, lacks any flags
that distinguish between different multi-byte encodings. POSIX does not
give any suggestion that tcsetattr() examines the LC_CTYPE locale,
either.

One could imagine at least two ways of communicating the current encoding
in a portable way to the place where the Canonical Mode Input Processing
takes place:

  (a) Add a few bits to the c_lflag of the termios structure that
      distinguish between families of encodings such as

         TCCHARSET    bitmask for the charset field

           TCCSINGLE  single-byte encoding (e.g, any part of ISO 8859)
           TCCUTF8    UTF-8 (ISO 10646)
           TCCEUC     EUC (AT&T's Extended Unix Code) 
           ...

      This granularity of identifying the character set just conveys
      enough information necessary to identify character boundaries
      in the input byte buffer.

  (b) extend the definition of tcsetattr() such that it becomes
      dependent on the LC_CTYPE locale and in particular communicates
      the current value of nl_langinfo(CODESET) to the terminal driver.

Either option would enable the implementation of the ERASE control
function to remove the correct number of bytes from the input buffer.

Do the authors of the POSIX standard have any view on which of these
two options are more in the spirit of the standard?

Or is this recognized as an open question in need of an ammendment
to the standard?


A second, closely related problem is the question of how to implement
the function

  "If ECHOE and ICANON are set, the ERASE character shall cause
  the terminal to erase, if possible, the last character in the
  current line from the display."

Current practice appears to be to simply transmit the sequence
BACKSPACE SPACE BACKSPACE to the terminal. However, BACKSPACE
typically repositions the cursor only one cell to the left and
SPACE overwrites only one single cell. But UTF-8 and EUC terminals
typically output CJK ideographs using double-width glyphs that
are twice as wide as a SPACE (two "cells" instead of one). The
ISO 6429 standard that is the basis for the widely used VT100
terminal semantics does not seem to foresee the use of
double-width glyphs.

If follows that these terminals must be modified such that
BACKSPACE moves the cursor one character instead of one cell
to the left, and that a space character the overwrites half of
a double-width glyph must erase the full double-width glyph.

Any other solution would imply that the implementation of the
ECHOE function has access to the wcwidth() function, to be able to
output BACKSPACE BACKSPACE SPACE SPACE BACKSPACE BACKSPACE when
ERASE removes a wcwidth()=2 character. As the Canonical Mode Input
Processing is traditionally implemented in a terminal device driver
in the kernel, which is otherwise free of locale-dependent
functionality and has no access to wcwidth(), this approach seems
highly undesireable.

[Things get even more tricky with the available experimental terminal
support (e.g., in XFree86's xterm) for combining characters such as
diacritical marks, which are characters with wcwidth()=0. It
is not yet common practice to put combining characters onto keyboard
mappings, but they might in principle find their way into a Canonical
Mode Input Processing buffer via cut&paste facilities.]

Markus

-- 
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Canonical Mode Input Processing with multi-byte character sets

Reply via email to