On Fri, Apr 11, 2025 at 08:28:38PM +0200, Ingo Schwarze wrote:
> Hi Walter,
> 
> On Sat, Apr 05, 2025 at 01:33:10PM +0200, Walter Alejandro Iglesias wrote:
> 
> > I discovered another issue (in the unpatched ksh).
> 
> Since i hate posting UTF-8 to mailing lists, let me paraphrase your
> finding such that i do not need to quote your (excessively verbose)
> example.
> 
>   Put ksh(1) into VI command line editing mode.
>   Type a line containing any UTF-8 character (just a two-byte line
>     consisting of a single character in enough).
>   Put anything into the yank buffer (even a single ASCII character is enough).
>   Put the cursor on any UTF-8 character (technically, that means the
>     code implementating the shell has the cursor sitting on the first
>     byte of that UTF-8 character, but that is an implementation detail
>     not visible to the user).
>   Issue the paste (p) command.
> 
> The result is that the content of the yank buffer is inserted into the
> middle of the UTF-8 character.  Fur example, if you use the UTF-8
> character 0xc3 0xa9 (e-accent aigu) and the paste buffer 0x61 (letter a),
> then you get 0xc3 0x61 0xa9, which is not a valid UTF-8 string.
> 
> Consequently, i do confirm that you found a bug.

Your long and detailed explanations reflect your methodical thinking,
which is undoubtedly less prone to error, but this has its downside:
methodical and logical thinking isn't always effective in uncovering
errors and finding solutions to problems, at least not in a first stage.
In every creative endeavor, there's a chaotic first phase, a famous
Argentine writer, Jorge Luis Borges, quoted advice from his father, who
was also a writer: "First, unleash your imagination.  Then, the axe."
Also, we have to take into account how lazy people are to read these
days, for example, from what you tell me below, I suspect you didn't
read the entire thread, especially the first message[1].

> 
> 
> Walter Alejandro Iglesias wrote on Fri, Apr 11, 2025 at 12:18:22PM +0200:
> 
> > Please excuse me if I'm a bit chaotic.
> 
> On the one hand, often enough bad patches trigger good ones,
> so do not panic.  On the other hand, understand trying to fix stuff
> and sending bad patches as a valuable learning experience.

I fully agree with what you're saying[2], but reactions from other
developers in more than one occasion in these lists led me to think that
not everyone here agrees with this.

> 
> > I don't do this every day;
> 
> I don't do so any longer either; it has been a few years that i looked
> at this code.
> 
> > I have trouble understanding the code at first.
> 
> It is indeed mostly devoid of comments.  On the other hand, much of it
> is pretty straightforward.  So while a few comments here and there
> might help to clarify some of the less obvious aspects, i don't
> think we need comments all over the place.

I thought the same thing too.

> 
> > This seems to work:
> >  
> > Index: vi.c
> > ===================================================================
> > RCS file: /cvs/src/bin/ksh/vi.c,v
> > diff -u -p -r1.60 vi.c
> > --- vi.c    12 Mar 2021 02:10:25 -0000      1.60
> > +++ vi.c    11 Apr 2025 09:59:40 -0000
> > @@ -834,8 +834,12 @@ vi_cmd(int argcnt, const char *cmd)
> >  
> >             case 'p':
> 
>   /* What follows is the implementation of the paste (p) command. */
>      (Clear enough without a comment.)
> 
> >                     modified = 1; hnum = hlast;
> > -                   if (es->linelen != 0)
> > +                   if (es->linelen != 0) {
> >                             es->cursor++;
> 
>   /* This advances the cursor to the next byte.
>      Needed because the subsequent putbuf() insertion inserts left
>      of the cursor, but we want to insert to the right of it.
>      You correctly recognized that the problem is that advancing
>      by a single byte is not enough if we are sitting on the first
>      byte of a UTF-8 character. */
> 
> > +                           while (!isascii(es->cbuf[es->cursor]) && 
> > +                               es->cursor < es->linelen)
> > +                                   es->cursor++;
> 
> That, however, i totally wrong.  I conclude that from mere code inspection
> and i'm not even going to test it.  Imagine we are sitting at the
> beginning of a long string of UTF-8 characters.  This will hurry
> past the whole long string because none of the bytes in this long
> string is ASCII.
> 
> So instead of inserting after the current character and before the next
> character, you would now skip potentially many UTF-8 characters and
> insert before the next ASCII character, perhaps far to the right.

You're right on this point.  This is where methodical thinking (the axe)
comes to do its job. :-)

> 
> 
> So what is needed instead?  This is a faily standard task:
> 
>   If we sit on a UTF-8 start byte, advance past the whole UTF-8
>   character.
> 
> So what we need is
> 
>   while (es->cursor < es->linelen && isu8cont(es->cbuf[es->cursor]))
>       es->cursor++;

Curiously, this is almost what I tried at first, except for I didn't
take in care to limit the loop to linelen.  To be precise, I did this:

  while (isu8cont(es->cbuf[es->cursor]))
        es->cursor++;

This gave me segfaults, now it's obvious to me why but then I didn't
understand, so I looked for a solution with isspace() (aware that there
was something sloppy about this solution.)

> 
> But that can be expressed in a more compact way, see below.
> 
> A few lines down, the es->cursor-- that you do not change is also
> wrong.  After the insertion, the curser sits after the insertion,
> but we want it on the last letter of the insertion.
> 
> > +                   }
> >                     while (putbuf(ybuf, yanklen, 0) == 0 && --argcnt > 0)
> >                             ;
> >                     if (es->cursor != 0)
> > @@ -1195,8 +1199,7 @@ domove(int argcnt, const char *cmd, int 
> >                     return -1;
> >             ncursor = (*cmd == 'e' ? endword : Endword)(argcnt);
> >             if (!sub)
> > -                   while (isu8cont((unsigned char)es->cbuf[--ncursor]))
> > -                           continue;
> > +                   --ncursor;
> >             break;
> 
> This change breaks the "move to end of word" (e) command.

Could you give me an example?  (Read my following paragraph, please.)

> It seems unrelated because your example only uses "ye", that is,
> "e" as a subcommand, but your code change only changes the case
> where "e" is stand-alone, i.e. *not* a subcommand.  Consequently,
> i do not understand what you are trying to achieve with this change
> in domove().

[1] This is why I suspect you didn't read my other messages.  Thanks to
an explanation from Lucas I found the second issue (the 'p' command
one.)  As I explain in my messages, my last patch was intended to fix
both issues.

About the first issue, the 'e' command one.  In the following part of
domove() I still don't understand what the while() conditional is for:

        case 'e':
        case 'E':
                if (!sub && es->cursor + 1 >= es->linelen)
                        return -1;
                ncursor = (*cmd == 'e' ? endword : Endword)(argcnt);
                if (!sub)
                        while (isu8cont((unsigned char)es->cbuf[--ncursor]))
                                continue;
                break;

In any case, if you preffer not to remove that conditional the following
diff also fix the first issue:

Index: vi.c
===================================================================
RCS file: /cvs/src/bin/ksh/vi.c,v
diff -u -p -r1.60 vi.c
--- vi.c        12 Mar 2021 02:10:25 -0000      1.60
+++ vi.c        12 Apr 2025 08:06:14 -0000
@@ -1194,9 +1194,11 @@ domove(int argcnt, const char *cmd, int 
                if (!sub && es->cursor + 1 >= es->linelen)
                        return -1;
                ncursor = (*cmd == 'e' ? endword : Endword)(argcnt);
-               if (!sub)
-                       while (isu8cont((unsigned char)es->cbuf[--ncursor]))
+               if (!sub) {
+                       while (isu8cont((unsigned char)es->cbuf[ncursor]))
                                continue;
+                       --ncursor;
+               }
                break;
 
        case 'f':


For others testing convenience I paste a diff at the bottom including
this and your diff together.


> 
> The following patch is not yet complete and only survived minimal,
> but not yet sufficient testing.  For example, the 'P' command
> right below obviously suffers from the same problem.

As far as I tested, I didn't notice the 'P' failing with the same error.
That's why I didn't touch anything there.

> 
> Even if i would get OKs for the following patch right now,

As far as I tested it, it seems that your patch effectively solves the
second issue.


> i would
> prefer waiting until after release.  We have clearly lived with this
> bug fur years, so it is not release-critical, and the risk of breaking
> the shell so close to release does not feel acceptable.
> 
> As a final remark, this patch illustrates what i said a few days ago.
> The places in the code affected by these kinds of issues are not
> easy to find.  There may be no more conspicious indications in the
> code than an innocuous-looking ++ or -- operator.
> 
> Yours,
>   Ingo


Thank you, Ingo!  I'm glad to have your feedback again.  I thought you'd
gotten tired of me. :-)


[2] I don't know if you remember, but a discussion I had with you
    privately *many* years ago where you expressed this approach was
    what encourage me to post my mail(1) patches in tech@ years later.
    That's why your first reaction back then seemed contradictory to me.


Final diff solving both issues:

Index: vi.c
===================================================================
RCS file: /cvs/src/bin/ksh/vi.c,v
diff -u -p -r1.60 vi.c
--- vi.c        12 Mar 2021 02:10:25 -0000      1.60
+++ vi.c        12 Apr 2025 08:10:28 -0000
@@ -834,12 +834,16 @@ vi_cmd(int argcnt, const char *cmd)
 
                case 'p':
                        modified = 1; hnum = hlast;
-                       if (es->linelen != 0)
-                               es->cursor++;
+                       /* Insert after the current character. */
+                       while (es->cursor < es->linelen)
+                               if (!isu8cont(es->cbuf[++es->cursor]))
+                                       break;
                        while (putbuf(ybuf, yanklen, 0) == 0 && --argcnt > 0)
                                ;
-                       if (es->cursor != 0)
-                               es->cursor--;
+                       /* Back up to the last character inserted. */
+                       while (es->cursor != 0)
+                               if (!isu8cont(es->cbuf[--es->cursor]))
+                                       break;
                        if (argcnt != 0)
                                return -1;
                        break;
@@ -1194,9 +1198,11 @@ domove(int argcnt, const char *cmd, int 
                if (!sub && es->cursor + 1 >= es->linelen)
                        return -1;
                ncursor = (*cmd == 'e' ? endword : Endword)(argcnt);
-               if (!sub)
-                       while (isu8cont((unsigned char)es->cbuf[--ncursor]))
+               if (!sub) {
+                       while (isu8cont((unsigned char)es->cbuf[ncursor]))
                                continue;
+                       --ncursor;
+               }
                break;
 
        case 'f':


-- 
Walter

Reply via email to