Re: ksh(1): don't output invalid UTF-8 characters

Anton Lindqvist Fri, 19 May 2017 06:18:37 -0700

On Fri, May 19, 2017 at 09:33:33AM -0300, Lucas Gabriel Vuotto wrote:
> Hi,
> 
> On 19/05/17 03:42, Anton Lindqvist wrote:
> > Hi,
> > I did submit this problem[1] earlier but with an incomplete analysis and
> > fix. Here's a second attempt.
> > 
> > This does only occur when running ksh with emacs mode under tmux. How to
> > re-produce:
> > 
> > 1. Run ksh under tmux.
> > 
> > 2. Input the following characters, without spaces:
> > 
> >    a (any character) ^B (backward-char) ö (any UTF-8 character)
> > 
> > 3. At this point, the prompt gets overwritten.
> > 
> > Since ksh read a single byte of input, it will display a partial UTF-8
> > character before the whole character has been read. This is especially
> > troublesome when the cursor is not placed at the end of the line. In the
> > scenario above, after reading the first byte of 'ö' the following
> > sequence will be displayed:
> > 
> >   0xc3 0x61 0x08
> > 
> > That is the first byte of 'ö' (0xc3), 'a' (0x61), '\b' (0x08). tmux
> > does the right thing here, since 0xc3 is a valid UTF-8 start byte it
> > expects it to be followed by a UTF-8 continuation byte which is not the
> > case. The two first bytes (0xc3, 0x61) are discarded and the parser is
> > reset to its initial state causing the backspace to be accepted and the
> > first character in the prompt to be overwritten.
> > 
> > After the second byte of 'ö' (0xb6) is read by ksh, the following
> > sequence will be displayed:
> > 
> >    0x08 0xc3 0xb6 0x61 0x08
> > 
> > That is '\b' (0x08), 'ö' (0xc3, 0xb6), 'a' (0x61), '\b' (0x08). Since
> > ksh assumes the cursor is correctly positioned it displays a leading
> > backspace in order to move passed the first character. This is however
> > not true causing another character in the prompt to be overwritten.
> > 
> > Below is diff that make sure to read a whole UTF-8 character in
> > x_emacs() prior doing another iteration of the main-loop which solves
> > the problem. It does not validate UTF-8 input but instead assumes every
> > such character is valid.
> > 
> > Comments and feedback are much appreciated.
> > 
> > [1] http://marc.info/?l=openbsd-misc&m=148509346310901&w=2
> > 
> > Index: emacs.c
> > ===================================================================
> > RCS file: /cvs/src/bin/ksh/emacs.c,v
> > retrieving revision 1.67
> > diff -u -p -r1.67 emacs.c
> > --- emacs.c 12 May 2017 14:37:52 -0000      1.67
> > +++ emacs.c 14 May 2017 08:21:26 -0000
> > @@ -98,6 +98,7 @@ static int        x_col;
> >  static int x_displen;
> >  static int x_arg;          /* general purpose arg */
> >  static int x_arg_defaulted;/* x_arg not explicitly set; defaulted to 1 */
> > +static int x_getc_again;
> >  
> >  static int xlp_valid;
> >  /* end from 4.9 edit.h } */
> > @@ -142,6 +143,7 @@ static int      x_fold_case(int);
> >  static char        *x_lastcp(void);
> >  static void        do_complete(int, Comp_type);
> >  static int isu8cont(unsigned char);
> > +static int u8len(unsigned char);
> >  
> >  /* proto's for keybindings */
> >  static int x_abort(int);
> > @@ -272,6 +274,21 @@ isu8cont(unsigned char c)
> >     return (c & (0x80 | 0x40)) == 0x80;
> >  }
> >  
> > +static int
> > +u8len(unsigned char c)
> > +{
> > +   switch (c & 0xF0) {
> > +   case 0xF0:
> > +           return 4;
> > +   case 0xE0:
> > +           return 3;
> > +   case 0xC0:
> > +           return 2;
> > +   default:
> > +           return 1;
> > +   }
> > +}
> > +
> 
> This is wrong: most codepoints in the range U+0080-U+07ff (the ones greater 
> than U+0400) would be interpreted as being 1 character long instead of 2.


Thanks for the heads-up. Maybe a more reliable solution would be to call
mbtowc(3) repeatedly as new input arrives until it returns successfully.
Assuming the first read byte is a UTF-8 start byte.

Re: ksh(1): don't output invalid UTF-8 characters

Reply via email to