[Crosspost from comp.std.unix about the old cooked mode issue, which as I understand is still not resolved in the latest Linux kernel, possibly due to unclear standardization.]
The Canonical Mode Input Processing that takes place on POSIX systems when the ICANON flag in the c_lflag of struct termios is set supports a control character ERASE that "erases the last character in the current line". http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap11.html In the case of multi-byte character sets, such as the UTF-8 encoding widely used in recent GNU/Linux distributions, this means that the function that implements the ERASE action must know what the encoding is, to be able to remove the correct number of bytes per character. How exactly do the authors of the POSIX standard envision, how the implementation of the ERASE action learns about the character encoding that it deals with? The termios structure, which controls every other aspect of the Canonical Mode Input Processing in great detail, lacks any flags that distinguish between different multi-byte encodings. POSIX does not give any suggestion that tcsetattr() examines the LC_CTYPE locale, either. One could imagine at least two ways of communicating the current encoding in a portable way to the place where the Canonical Mode Input Processing takes place: (a) Add a few bits to the c_lflag of the termios structure that distinguish between families of encodings such as TCCHARSET bitmask for the charset field TCCSINGLE single-byte encoding (e.g, any part of ISO 8859) TCCUTF8 UTF-8 (ISO 10646) TCCEUC EUC (AT&T's Extended Unix Code) ... This granularity of identifying the character set just conveys enough information necessary to identify character boundaries in the input byte buffer. (b) extend the definition of tcsetattr() such that it becomes dependent on the LC_CTYPE locale and in particular communicates the current value of nl_langinfo(CODESET) to the terminal driver. Either option would enable the implementation of the ERASE control function to remove the correct number of bytes from the input buffer. Do the authors of the POSIX standard have any view on which of these two options are more in the spirit of the standard? Or is this recognized as an open question in need of an ammendment to the standard? A second, closely related problem is the question of how to implement the function "If ECHOE and ICANON are set, the ERASE character shall cause the terminal to erase, if possible, the last character in the current line from the display." Current practice appears to be to simply transmit the sequence BACKSPACE SPACE BACKSPACE to the terminal. However, BACKSPACE typically repositions the cursor only one cell to the left and SPACE overwrites only one single cell. But UTF-8 and EUC terminals typically output CJK ideographs using double-width glyphs that are twice as wide as a SPACE (two "cells" instead of one). The ISO 6429 standard that is the basis for the widely used VT100 terminal semantics does not seem to foresee the use of double-width glyphs. If follows that these terminals must be modified such that BACKSPACE moves the cursor one character instead of one cell to the left, and that a space character the overwrites half of a double-width glyph must erase the full double-width glyph. Any other solution would imply that the implementation of the ECHOE function has access to the wcwidth() function, to be able to output BACKSPACE BACKSPACE SPACE SPACE BACKSPACE BACKSPACE when ERASE removes a wcwidth()=2 character. As the Canonical Mode Input Processing is traditionally implemented in a terminal device driver in the kernel, which is otherwise free of locale-dependent functionality and has no access to wcwidth(), this approach seems highly undesireable. [Things get even more tricky with the available experimental terminal support (e.g., in XFree86's xterm) for combining characters such as diacritical marks, which are characters with wcwidth()=0. It is not yet common practice to put combining characters onto keyboard mappings, but they might in principle find their way into a Canonical Mode Input Processing buffer via cut&paste facilities.] Markus -- Markus Kuhn, Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
