On Fri, Apr 11, 2025 at 08:28:38PM +0200, Ingo Schwarze wrote: > Hi Walter, > > On Sat, Apr 05, 2025 at 01:33:10PM +0200, Walter Alejandro Iglesias wrote: > > > I discovered another issue (in the unpatched ksh). > > Since i hate posting UTF-8 to mailing lists, let me paraphrase your > finding such that i do not need to quote your (excessively verbose) > example. > > Put ksh(1) into VI command line editing mode. > Type a line containing any UTF-8 character (just a two-byte line > consisting of a single character in enough). > Put anything into the yank buffer (even a single ASCII character is enough). > Put the cursor on any UTF-8 character (technically, that means the > code implementating the shell has the cursor sitting on the first > byte of that UTF-8 character, but that is an implementation detail > not visible to the user). > Issue the paste (p) command. > > The result is that the content of the yank buffer is inserted into the > middle of the UTF-8 character. Fur example, if you use the UTF-8 > character 0xc3 0xa9 (e-accent aigu) and the paste buffer 0x61 (letter a), > then you get 0xc3 0x61 0xa9, which is not a valid UTF-8 string. > > Consequently, i do confirm that you found a bug.
Your long and detailed explanations reflect your methodical thinking, which is undoubtedly less prone to error, but this has its downside: methodical and logical thinking isn't always effective in uncovering errors and finding solutions to problems, at least not in a first stage. In every creative endeavor, there's a chaotic first phase, a famous Argentine writer, Jorge Luis Borges, quoted advice from his father, who was also a writer: "First, unleash your imagination. Then, the axe." Also, we have to take into account how lazy people are to read these days, for example, from what you tell me below, I suspect you didn't read the entire thread, especially the first message[1]. > > > Walter Alejandro Iglesias wrote on Fri, Apr 11, 2025 at 12:18:22PM +0200: > > > Please excuse me if I'm a bit chaotic. > > On the one hand, often enough bad patches trigger good ones, > so do not panic. On the other hand, understand trying to fix stuff > and sending bad patches as a valuable learning experience. I fully agree with what you're saying[2], but reactions from other developers in more than one occasion in these lists led me to think that not everyone here agrees with this. > > > I don't do this every day; > > I don't do so any longer either; it has been a few years that i looked > at this code. > > > I have trouble understanding the code at first. > > It is indeed mostly devoid of comments. On the other hand, much of it > is pretty straightforward. So while a few comments here and there > might help to clarify some of the less obvious aspects, i don't > think we need comments all over the place. I thought the same thing too. > > > This seems to work: > > > > Index: vi.c > > =================================================================== > > RCS file: /cvs/src/bin/ksh/vi.c,v > > diff -u -p -r1.60 vi.c > > --- vi.c 12 Mar 2021 02:10:25 -0000 1.60 > > +++ vi.c 11 Apr 2025 09:59:40 -0000 > > @@ -834,8 +834,12 @@ vi_cmd(int argcnt, const char *cmd) > > > > case 'p': > > /* What follows is the implementation of the paste (p) command. */ > (Clear enough without a comment.) > > > modified = 1; hnum = hlast; > > - if (es->linelen != 0) > > + if (es->linelen != 0) { > > es->cursor++; > > /* This advances the cursor to the next byte. > Needed because the subsequent putbuf() insertion inserts left > of the cursor, but we want to insert to the right of it. > You correctly recognized that the problem is that advancing > by a single byte is not enough if we are sitting on the first > byte of a UTF-8 character. */ > > > + while (!isascii(es->cbuf[es->cursor]) && > > + es->cursor < es->linelen) > > + es->cursor++; > > That, however, i totally wrong. I conclude that from mere code inspection > and i'm not even going to test it. Imagine we are sitting at the > beginning of a long string of UTF-8 characters. This will hurry > past the whole long string because none of the bytes in this long > string is ASCII. > > So instead of inserting after the current character and before the next > character, you would now skip potentially many UTF-8 characters and > insert before the next ASCII character, perhaps far to the right. You're right on this point. This is where methodical thinking (the axe) comes to do its job. :-) > > > So what is needed instead? This is a faily standard task: > > If we sit on a UTF-8 start byte, advance past the whole UTF-8 > character. > > So what we need is > > while (es->cursor < es->linelen && isu8cont(es->cbuf[es->cursor])) > es->cursor++; Curiously, this is almost what I tried at first, except for I didn't take in care to limit the loop to linelen. To be precise, I did this: while (isu8cont(es->cbuf[es->cursor])) es->cursor++; This gave me segfaults, now it's obvious to me why but then I didn't understand, so I looked for a solution with isspace() (aware that there was something sloppy about this solution.) > > But that can be expressed in a more compact way, see below. > > A few lines down, the es->cursor-- that you do not change is also > wrong. After the insertion, the curser sits after the insertion, > but we want it on the last letter of the insertion. > > > + } > > while (putbuf(ybuf, yanklen, 0) == 0 && --argcnt > 0) > > ; > > if (es->cursor != 0) > > @@ -1195,8 +1199,7 @@ domove(int argcnt, const char *cmd, int > > return -1; > > ncursor = (*cmd == 'e' ? endword : Endword)(argcnt); > > if (!sub) > > - while (isu8cont((unsigned char)es->cbuf[--ncursor])) > > - continue; > > + --ncursor; > > break; > > This change breaks the "move to end of word" (e) command. Could you give me an example? (Read my following paragraph, please.) > It seems unrelated because your example only uses "ye", that is, > "e" as a subcommand, but your code change only changes the case > where "e" is stand-alone, i.e. *not* a subcommand. Consequently, > i do not understand what you are trying to achieve with this change > in domove(). [1] This is why I suspect you didn't read my other messages. Thanks to an explanation from Lucas I found the second issue (the 'p' command one.) As I explain in my messages, my last patch was intended to fix both issues. About the first issue, the 'e' command one. In the following part of domove() I still don't understand what the while() conditional is for: case 'e': case 'E': if (!sub && es->cursor + 1 >= es->linelen) return -1; ncursor = (*cmd == 'e' ? endword : Endword)(argcnt); if (!sub) while (isu8cont((unsigned char)es->cbuf[--ncursor])) continue; break; In any case, if you preffer not to remove that conditional the following diff also fix the first issue: Index: vi.c =================================================================== RCS file: /cvs/src/bin/ksh/vi.c,v diff -u -p -r1.60 vi.c --- vi.c 12 Mar 2021 02:10:25 -0000 1.60 +++ vi.c 12 Apr 2025 08:06:14 -0000 @@ -1194,9 +1194,11 @@ domove(int argcnt, const char *cmd, int if (!sub && es->cursor + 1 >= es->linelen) return -1; ncursor = (*cmd == 'e' ? endword : Endword)(argcnt); - if (!sub) - while (isu8cont((unsigned char)es->cbuf[--ncursor])) + if (!sub) { + while (isu8cont((unsigned char)es->cbuf[ncursor])) continue; + --ncursor; + } break; case 'f': For others testing convenience I paste a diff at the bottom including this and your diff together. > > The following patch is not yet complete and only survived minimal, > but not yet sufficient testing. For example, the 'P' command > right below obviously suffers from the same problem. As far as I tested, I didn't notice the 'P' failing with the same error. That's why I didn't touch anything there. > > Even if i would get OKs for the following patch right now, As far as I tested it, it seems that your patch effectively solves the second issue. > i would > prefer waiting until after release. We have clearly lived with this > bug fur years, so it is not release-critical, and the risk of breaking > the shell so close to release does not feel acceptable. > > As a final remark, this patch illustrates what i said a few days ago. > The places in the code affected by these kinds of issues are not > easy to find. There may be no more conspicious indications in the > code than an innocuous-looking ++ or -- operator. > > Yours, > Ingo Thank you, Ingo! I'm glad to have your feedback again. I thought you'd gotten tired of me. :-) [2] I don't know if you remember, but a discussion I had with you privately *many* years ago where you expressed this approach was what encourage me to post my mail(1) patches in tech@ years later. That's why your first reaction back then seemed contradictory to me. Final diff solving both issues: Index: vi.c =================================================================== RCS file: /cvs/src/bin/ksh/vi.c,v diff -u -p -r1.60 vi.c --- vi.c 12 Mar 2021 02:10:25 -0000 1.60 +++ vi.c 12 Apr 2025 08:10:28 -0000 @@ -834,12 +834,16 @@ vi_cmd(int argcnt, const char *cmd) case 'p': modified = 1; hnum = hlast; - if (es->linelen != 0) - es->cursor++; + /* Insert after the current character. */ + while (es->cursor < es->linelen) + if (!isu8cont(es->cbuf[++es->cursor])) + break; while (putbuf(ybuf, yanklen, 0) == 0 && --argcnt > 0) ; - if (es->cursor != 0) - es->cursor--; + /* Back up to the last character inserted. */ + while (es->cursor != 0) + if (!isu8cont(es->cbuf[--es->cursor])) + break; if (argcnt != 0) return -1; break; @@ -1194,9 +1198,11 @@ domove(int argcnt, const char *cmd, int if (!sub && es->cursor + 1 >= es->linelen) return -1; ncursor = (*cmd == 'e' ? endword : Endword)(argcnt); - if (!sub) - while (isu8cont((unsigned char)es->cbuf[--ncursor])) + if (!sub) { + while (isu8cont((unsigned char)es->cbuf[ncursor])) continue; + --ncursor; + } break; case 'f': -- Walter