Hi Nicholas,

Nicholas Marriott wrote on Fri, May 19, 2017 at 07:04:53PM +0100:

> Perhaps I haven't understood what you are saying correctly,

What matters most is that sending an incomplete character
followed by U+0008 (ASCII BACKSPACE) is a no-op, both in the sense
that it doesn't change the line being edited and that it doesn't
change the display.  All terminals you mentioned seem to conform
to that according to my testing, except tmux.

> but I don't think it is possible to send control characters or
> any other invalid UTF-8 bytes inside UTF-8 characters and safely
> predict what the terminal will do. How about these examples:

If the invalid bytes are still present by the time the line is sent
off for processing (like in your example printf '\343\203\n'), then
it is indeed hard to predict what random terminals will do, though
i would argue that xterm's behaviour is correct (print one substitution
glyph for each incomplete character, or if bytes don't form even
incomplete characters, then one for each such invalid byte.

urxvt is clearly broken:

 $ printf '\343\203x\n'

prints U+00E3 x linefeed; i have no idea what it does to garble
0xe383 into 0xc3a3.  Maybe some naive misparsing, or spewing out
incomplete parsing state in some inconsistent way.

gnome-terminal and konsole print one replacement character for each
invalid byte, even if bytes form an incomplete character.  Maybe
not outright wrong, but arguably a bit confusing.

So yeah, if lines containing incomplete sequences *when they are
sent off* misformat with tmux and gnome-terminal or konsole, i
wouldn't call that tmux'es fault, and i agree there is little that
can be done about it.

> Having tmux ignore the whole lot seems like a relatively sensible
> course.

Well, what tmux currently does is making sure that everything gets
broken in the maximum possible way on every terminal, even if the
line that is finally sent off is completely correct.

> The only other alternative would be to substitute U+FFFD.

Why not just pass the bytes through?  I don't think it's the job
of a terminal mutiplexer to mess with individual bytes.  It's the
job of the final terminal doing the display to select glyphs and
place them, for printable characters, for non-printable characters,
for incomplete characters, and for invalid bytes.

> But that seems iffy too - U+FFFD is width 1, but what if the
> application is expecting a width 2?

By definition, incomplete characters and invalid bytes don't
have a width, so it doesn't matter what the application wanted
(for example, which character the user intended to type but
didn't finish).  What matters is how wide the replacement
glyph will look on the final terminal.  In that respect, we
cannot help making an assumption, and "incomplete sequences
and invalid bytes are displayed as U+FFFD (i.e. width 1)
seems about the best we can do.  That may be slightly off
for gnome-terminal and konsole, but i don't see how that can
be helped.

Anyway, these subtleties of invalid bytes that *remain* are not
the main inconvenience in practice.  What matters more is that
tmux breaks even if incomplete characters are deleted again
with backspaces and never sent off.

Yours,
  Ingo

Reply via email to