Re: ksh(1): don't output invalid UTF-8 characters

Walter Alejandro Iglesias Mon, 05 Jun 2017 12:21:57 -0700

On Mon, Jun 05, 2017 at 06:06:34PM +0200, Ingo Schwarze wrote:
> Hi Walter,
> 
> Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 04:50:21PM +0200:
> 
> > report (I'm on chapter 2 of K&R :-)).  I wish with time I'll learn how
> > to do it.
> 
> IIRC, you said you saw some undesirable behaviour with ksh input.
> 
> I assume you have a sequence of key presses on your keyboard that
> demonstrate the undesirable behaviour.  To capture the sequence,
>


I will *study* all the indications you gave me.  But this time
I don't think you need a capture of the sequence.  Just use *any*
latin-1 character whose hex value is smaller than \xc0.

To facilitate you the test, in xterm after setting "setxkbmap de":

  AltGr + Shift + 1

prints me the opening exclamation mark (\xa1) we also use in Spanish.
In console or a C xterm, type that merged among random ascii characters,
then move the cursor from the first to the last column passing over that
character.  Assuming you're running current, see what happens.


Anyway, to be honest, these bugs don't hurt, you can live with them.
What I'm trying to say with these reports is I'm not truly convinced
utf8 support in console is a good idea.

Another test you can do, this time in a utf-8 xterm: if you activate the
bell and go with the cursor to the end of the line it'll beep.  Now type
some utf-8 character at the end and do the same, it won't beep, because
the cursor is in the first byte of the utf-8 character, *it can't reach
the real end of the line*.  Nobody will die because this issue or the
other above.  My point is utf8 will always be a mess.  KEN, DO YOU HEAR
ME?, IT WAS YOUR OWN CHILD, KEN! :-)

I wonder how plan9 handle utf8.


[...]

>
> 
> For testing, go to the regress directory:
> 
>    $ cd /usr/src/regress/bin/ksh
>    $ cvs up -dP
>    $ cd edit
>    $ make obj
>    $ make cleandir
>    $ make regress
>    $ ./obj/edit < input.txt | hexdump -C
>   00000000  24 20 78 79 08 c3 a9 79  08 0a   |$ xy...y..|
>   0000000a


I've been wondering how to work with this.  Thanks!


[...]

> > By the way, something the last paragraph of the new utf8(7) man page
> > isn't clear enough (I mentioned this to tedu@).
> 
> Which paragraph exactly, and what is unclear?  Maybe we can fix it
> quickly.

As I told you, the _last_ one:

   Encodings using more bytes than required are invalid.  In particular,
   11000000 and 11000001 are not valid start bytes, the byte after
   11100000 must be at least 10100000, and the byte after 11110000 must
   be at least 10010000.

I don't understand the 'at least' assumptions.  Some examples in which
the byte after 1110.... is *smaller* than 1010....:

    Euro sign:
    11100010 10000010 10101100

    Em dash:
    11100010 10000000 10010100

    Double quotes:
    11100010 10000000 10011100
    11100010 10000000 10011101

You can find examples where the byte after 1110.... is *grater* than
1010.... here:

http://www.utf8-chartable.de/



Thank you for your advices I'll study your whole message carefuly.


> 
> Yours,
>   Ingo

Re: ksh(1): don't output invalid UTF-8 characters

Reply via email to