Re: ksh(1): don't output invalid UTF-8 characters

Nicholas Marriott Fri, 19 May 2017 11:05:21 -0700

Hi

Perhaps I haven't understood what you are saying correctly, but I don't
think it is possible to send control characters or any other invalid
UTF-8 bytes inside UTF-8 characters and safely predict what the terminal
will do. How about these examples:


printf '\343\203\010\217a\n'
printf '\343\203\r\217b\n'
printf '\343\203\n\217c\n'

Or with a width 1 character:

printf '\342\211\010\240\n'
printf '\342\211\r\240b\n'
printf '\342\211\n\240c\n'

I have four UTF-8 capable X terminals here and between them they do
three different things (xterm, rxvt-unicode, gnome-terminal & konsole).

tmux can't forward this stuff to the terminal outside, because if it
can't predict what that terminal will do, tmux's idea of the display
will get out of sync with reality.

Having tmux ignore the whole lot seems like a relatively sensible
course.

The only other alternative would be to substitute U+FFFD. But that seems
iffy too - U+FFFD is width 1, but what if the application is expecting
a width 2?




On Fri, May 19, 2017 at 06:54:38PM +0200, Ingo Schwarze wrote:
> Hi Anton,
> 
> Anton Lindqvist wrote on Fri, May 19, 2017 at 08:42:05AM +0200:
> 
> > 1. Run ksh under tmux.
> > 
> > 2. Input the following characters, without spaces:
> > 
> >    a (any character) ^B (backward-char) ?? (any UTF-8 character)
> > 
> > 3. At this point, the prompt gets overwritten.
> > 
> > Since ksh read a single byte of input, it will display a partial UTF-8
> > character before the whole character has been read. This is especially
> > troublesome when the cursor is not placed at the end of the line. In the
> > scenario above, after reading the first byte of '??' the following
> > sequence will be displayed:
> > 
> >   0xc3 0x61 0x08
> > 
> > That is the first byte of '??' (0xc3), 'a' (0x61), '\b' (0x08). tmux
> > does the right thing here, since 0xc3 is a valid UTF-8 start byte it
> > expects it to be followed by a UTF-8 continuation byte which is not the
> > case. The two first bytes (0xc3, 0x61) are discarded and the parser is
> > reset to its initial state
> 
> I call that a bug in tmux.  At least for UTF-8, tmux should never reset
> its parser.  Depending on the keyboard configuration, it may be possible
> to enter single non-UTF-8 bytes.  For example, at the console, i can do
> this:
> 
>  - type "printf x"
>  - press Ctrl-V, Ctrl-Alt-Z
>  - type "printf x | hexdump -C"
> 
> The result is:
> 
>   00000000  78 9a 78   |x.x|
>   00000003
> 
> All of the shell, the console, and xterm more or less handle such
> stunts, even though admittedly, that is not the usual way of typing
> in a binary file.  tmux is a terminal emulator, so it ought to cope,
> too, and not reset its state.
> 
> Note that this is yet another instance where the concept of arbitrary
> locales is utterly broken and insecure.  On some non-OpenBSD system
> supporting such insecure stuff, if the user has set an arbitrary,
> non-UTF-8, state-dependent locale and somehow manages to insert an
> invalid byte into the input stream, the state of the input stream
> becomes invalid, no further characters can be read from that terminal,
> and *there is no way to recover*.  The only remaining half-secure
> option is for the shell to exit, which may also not be very secure,
> on a different level.
> 
> But with UTF-8, there is no problem whatsoever dealing with invalid
> bytes, so tmux ought to cope.
> 
> > causing the backspace to be accepted and the
> > first character in the prompt to be overwritten.
> 
> That's part of the reason why tmux must not reset its state:
> The shell won't know about it, and the screen will become garbled.
> 
> > Below is diff that make sure to read a whole UTF-8 character in
> > x_emacs() prior doing another iteration of the main-loop
> 
> I don't think we can do that.  What if there is no next byte?
> Then the shell will hang until, maybe considerably later, the next
> character is typed.  Also, we cannot rely on parsing the UTF-8 start
> byte (even if done correctly).  It tells the byte length of the
> character only for valid byte sequences, and the byte sequence need
> not be valid.
> 
> Besides, i think patching the shell to work around terminal bugs
> (or terminal emulator bugs) is the wrong approach in the first place.
> 
> Yours,
>   Ingo
>

Re: ksh(1): don't output invalid UTF-8 characters

Reply via email to