Re: sed(1): missing NUL in pattern space

2017-07-05 Thread Otto Moerbeek
On Sat, Jul 01, 2017 at 01:20:19PM +, kshe wrote:

> On Tue, 27 Jun 2017 09:29:10 +, Otto Moerbeek wrote:
>  >
>  > at last a followup, for the original problem.
>  >
>  > This diff incorporates your later comment. It does not cause the newly
>  > added regress test to fail, though.
>  >
>  > So that poses the question if this is what you meant.
>  >
>  > -Otto
>  >
>  > Index: process.c
>  > ===
>  > RCS file: /cvs/src/usr.bin/sed/process.c,v
>  > retrieving revision 1.32
>  > diff -u -p -r1.32 process.c
>  > --- process.c22 Feb 2017 14:09:09 -1.32
>  > +++ process.c27 Jun 2017 09:16:33 -
>  > @@ -120,8 +120,10 @@ redirect:
>  >  cp = cp->u.c;
>  >  goto redirect;
>  >  case 'c':
>  > +if (pd)
>  > +break;
>  >  pd = 1;
>  > -psl = 0;
>  > +ps[psl = 0] = '\0';
>  >  if (cp->a2 == NULL || lastaddr || lastline())
>  >  (void)fprintf(outfile, "%s", cp->t);
>  >  break;
>  > @@ -138,6 +140,7 @@ redirect:
>  >  } else {
>  >  psl -= (p + 1) - ps;
>  >  memmove(ps, p + 1, psl);
>  > +ps[psl] = '\0';
>  >  goto top;
>  >  }
>  >  case 'g':
>  >
> 
> Indeed, there was a wording error in my second paragraph: I meant to
> align the behaviour of `l' (not `c') with that of `p', such that using
> `l' after `c' would not print anything (not even `$'), just like `p' in
> this case, which does not print anything either (not even a newline).

somebody take over this, please. I do not seem to be able to find the
time to validate this diff.

-Otto

> 
> Index: process.c
> ===
> RCS file: /cvs/src/usr.bin/sed/process.c,v
> retrieving revision 1.32
> diff -up -r1.32 process.c
> --- process.c 22 Feb 2017 14:09:09 -  1.32
> +++ process.c 1 Jul 2017 10:24:21 -
> @@ -121,7 +121,7 @@ redirect:
>   goto redirect;
>   case 'c':
>   pd = 1;
> - psl = 0;
> + ps[psl = 0] = '\0';
>   if (cp->a2 == NULL || lastaddr || lastline())
>   (void)fprintf(outfile, "%s", cp->t);
>   break;
> @@ -138,6 +138,7 @@ redirect:
>   } else {
>   psl -= (p + 1) - ps;
>   memmove(ps, p + 1, psl);
> + ps[psl] = '\0';
>   goto top;
>   }
>   case 'g':
> @@ -158,6 +159,8 @@ redirect:
>   (void)fprintf(outfile, "%s", cp->t);
>   break;
>   case 'l':
> + if (pd)
> + break;
>   lputs(ps);
>   break;
>   case 'n':
> 
> That being said, there is more to this issue than what such a simple
> patch could fix.  For example, in a way your diff also makes sense: if
> it is believed that one ought not to use `l' nor `p' to print an
> inexistant pattern space, then surely deleting it again with `c' should
> be disallowed too.  However, as partly demonstrated by my last message,
> there is absolutely no coherence anywhere about this, so in the end one
> might wonder which command is correctly behaved and which isn't.
> 
> For instance, it is a matter of course that the sequence `N;s/.*\n//;p'
> should be just the same as `n' when normal output has not been disabled;
> but, in OpenBSD's sed, because that `pd' flag is implemented in such an
> erratic fashion, this isn't always the case:
> 
>   $ jot 6 | sed 'c\
>   > text
>   > :loop
>   > N;s/.*\n//;p
>   > bloop'
>   text
>   2
>   4
>   6
> 
>   $ jot 6 | sed 'c\
>   > text
>   > :loop
>   > n
>   > bloop'
>   text
>   2
>   3
>   4
>   5
>   6
> 
> For the same reason, nothing is printed by the following script, even if
> all lines are read:
> 
>   $ jot 6 | sed 'c\
>   > text
>   > :loop
>   > $q;N
>   > bloop'
>   text
> 
> (As a side note, amusingly, replacing the `q' above with `s/$//p' would
> actually produce the expected output, when in contrast using something
> like `s/^//p' would not...  But this is an unrelated bug.)
> 
> Likewise, one would believe the command `y/a/b/' to be functionally
> equivalent to `s/a/b/g'; but, again, many examples of the opposite
> behaviour can be 

Re: sed(1): missing NUL in pattern space

2017-07-01 Thread kshe
On Tue, 27 Jun 2017 09:29:10 +, Otto Moerbeek wrote:
 >
 > at last a followup, for the original problem.
 >
 > This diff incorporates your later comment. It does not cause the newly
 > added regress test to fail, though.
 >
 > So that poses the question if this is what you meant.
 >
 > -Otto
 >
 > Index: process.c
 > ===
 > RCS file: /cvs/src/usr.bin/sed/process.c,v
 > retrieving revision 1.32
 > diff -u -p -r1.32 process.c
 > --- process.c22 Feb 2017 14:09:09 -1.32
 > +++ process.c27 Jun 2017 09:16:33 -
 > @@ -120,8 +120,10 @@ redirect:
 >  cp = cp->u.c;
 >  goto redirect;
 >  case 'c':
 > +if (pd)
 > +break;
 >  pd = 1;
 > -psl = 0;
 > +ps[psl = 0] = '\0';
 >  if (cp->a2 == NULL || lastaddr || lastline())
 >  (void)fprintf(outfile, "%s", cp->t);
 >  break;
 > @@ -138,6 +140,7 @@ redirect:
 >  } else {
 >  psl -= (p + 1) - ps;
 >  memmove(ps, p + 1, psl);
 > +ps[psl] = '\0';
 >  goto top;
 >  }
 >  case 'g':
 >

Indeed, there was a wording error in my second paragraph: I meant to
align the behaviour of `l' (not `c') with that of `p', such that using
`l' after `c' would not print anything (not even `$'), just like `p' in
this case, which does not print anything either (not even a newline).

Index: process.c
===
RCS file: /cvs/src/usr.bin/sed/process.c,v
retrieving revision 1.32
diff -up -r1.32 process.c
--- process.c   22 Feb 2017 14:09:09 -  1.32
+++ process.c   1 Jul 2017 10:24:21 -
@@ -121,7 +121,7 @@ redirect:
goto redirect;
case 'c':
pd = 1;
-   psl = 0;
+   ps[psl = 0] = '\0';
if (cp->a2 == NULL || lastaddr || lastline())
(void)fprintf(outfile, "%s", cp->t);
break;
@@ -138,6 +138,7 @@ redirect:
} else {
psl -= (p + 1) - ps;
memmove(ps, p + 1, psl);
+   ps[psl] = '\0';
goto top;
}
case 'g':
@@ -158,6 +159,8 @@ redirect:
(void)fprintf(outfile, "%s", cp->t);
break;
case 'l':
+   if (pd)
+   break;
lputs(ps);
break;
case 'n':

That being said, there is more to this issue than what such a simple
patch could fix.  For example, in a way your diff also makes sense: if
it is believed that one ought not to use `l' nor `p' to print an
inexistant pattern space, then surely deleting it again with `c' should
be disallowed too.  However, as partly demonstrated by my last message,
there is absolutely no coherence anywhere about this, so in the end one
might wonder which command is correctly behaved and which isn't.

For instance, it is a matter of course that the sequence `N;s/.*\n//;p'
should be just the same as `n' when normal output has not been disabled;
but, in OpenBSD's sed, because that `pd' flag is implemented in such an
erratic fashion, this isn't always the case:

$ jot 6 | sed 'c\
> text
> :loop
> N;s/.*\n//;p
> bloop'
text
2
4
6

$ jot 6 | sed 'c\
> text
> :loop
> n
> bloop'
text
2
3
4
5
6

For the same reason, nothing is printed by the following script, even if
all lines are read:

$ jot 6 | sed 'c\
> text
> :loop
> $q;N
> bloop'
text

(As a side note, amusingly, replacing the `q' above with `s/$//p' would
actually produce the expected output, when in contrast using something
like `s/^//p' would not...  But this is an unrelated bug.)

Likewise, one would believe the command `y/a/b/' to be functionally
equivalent to `s/a/b/g'; but, again, many examples of the opposite
behaviour can be constructed; for instance:

$ echo | sed 'c\
> text
> s/^/a/;s/^//
> y/a/b/
> s/^//p;d'
text
a

$ echo | sed 'c\
> text
> s/^/a/;s/^//
> s/a/b/g
> s/^//p;d'
text
b

Because, of course, thanks to that `pd' magic, `y' is only 

Re: sed(1): missing NUL in pattern space

2017-06-27 Thread Otto Moerbeek
On Thu, Jun 15, 2017 at 09:01:15AM +0200, Otto Moerbeek wrote:

at last a followup, for the original problem.

This diff incorporates your later comment. It does not cause the newly
added regress test to fail, though.

So that poses the question if this is what you meant.

-Otto

Index: process.c
===
RCS file: /cvs/src/usr.bin/sed/process.c,v
retrieving revision 1.32
diff -u -p -r1.32 process.c
--- process.c   22 Feb 2017 14:09:09 -  1.32
+++ process.c   27 Jun 2017 09:16:33 -
@@ -120,8 +120,10 @@ redirect:
cp = cp->u.c;
goto redirect;
case 'c':
+   if (pd)
+   break;
pd = 1;
-   psl = 0;
+   ps[psl = 0] = '\0';
if (cp->a2 == NULL || lastaddr || lastline())
(void)fprintf(outfile, "%s", cp->t);
break;
@@ -138,6 +140,7 @@ redirect:
} else {
psl -= (p + 1) - ps;
memmove(ps, p + 1, psl);
+   ps[psl] = '\0';
goto top;
}
case 'g':



Re: sed(1): missing NUL in pattern space

2017-06-15 Thread Otto Moerbeek
On Tue, Jun 13, 2017 at 10:08:11AM +, kshe wrote:

> On Sat, 10 Jun 2017 10:25:27 +, Otto Moerbeek wrote:
> > Thanks for the analysis and diff, I hope to get a chanche to think
> > about this soon. At least I'll make sure this diff is not forgotten,
> > -Otto
> 
> I have seen the tests that you recently added to the tree; in the mean
> time, however, while I continued experimenting and reading the source, I
> discovered more bugs (as expected), as well as other trivial issues
> which might be worthy of note before committing any real fixes.
> 
> First of all, my comprehension of the `c' command has slightly changed
> since my last message: I now think its behaviour should be aligned with
> that of `p', `P', `w' and others, all of which produce no output after
> `c', treating a deleted pattern space as truly inexistent, and not
> merely as an empty one.  It would hence appear more coherent to add
> 
>   if (pd)
>   break;
> 
> before calling lputs(), as is done in many other places.  This would
> then change the expected output of one of the newly added tests.
> 
> In general though, that `c' command is a big mess.  It is supposed to
> delete the pattern space: in other implementations of sed that I have
> tested, that means it will essentially behave like `i' followed by `d'.
> In BSD's sed, however, it continues processing commands...  Here's one
> of the many bugs that arise because of that:
> 
>   $ echo abc | sed 'c\
>   > text
>   > s/.*/x/
>   > p
>   > s/.*/y/
>   > p'
>   text
>   x
> 
> So why isn't `y' printed too?  In comparison, using flags instead of
> standalone commands:
> 
>   $ echo abc | sed 'c\
>   > text
>   > s/.*/x/p
>   > s/.*/y/p'
>   text
>   x
>   y
> 
> Here, we do get a `y', but only one (there should be two given that `-n'
> has not been supplied to suppress normal output).  Since at least `p' as
> a flag given to `s' appears to work, let us try the same with `w':
> 
>   $ echo abc | sed 'c\
>   > text
>   > s/.*/x/w /tmp/sed.out
>   > s/.*/y/w /tmp/sed.out'
>   text
>   $ cat /tmp/sed.out
>   x
> 
> But this time it doesn't work.  Now add a third `s' command...
> 
>   $ echo abc | sed 'c\
>   > text
>   > s/.*/x/w /tmp/sed.out
>   > s/.*/y/w /tmp/sed.out
>   > s/.*/z/w /tmp/sed.out'
>   text
>   z
>   $ cat /tmp/sed.out
>   x
>   z
> 
> And it works.  Well, not really: we do see a `z', but still no `y'.
> Indeed, this does not make any sense.
> 
> The reason for this strange behaviour is nevertheless easy to figure out
> when looking at how `s' is implemented, keeping in mind that the
> substitute space is globally defined.  In any case, even if one was to
> fix that particular bug, I would probably be able to find more.  This is
> why I believe that the only sane solution is that, just like `d', `c'
> should simply
> 
>   goto new;
> 
> which would, at the same time, obviate the need for the `deleted' field
> of the SPACE structure, subsequently allowing for substantial cleanups.
> 
> Also, aside of that `c' mess, there is still the fact that NUL bytes
> originally present in the input are not well handled at all.  Of course,
> `l' cannot print past them (whereas `p' can), which is a shame; but
> significantly worse is the inconsistent behaviour of `s' in that case.
> At first, it seems to work well (in the output, `^@' represents a NUL):
> 
>   $ printf 'x\0y' | sed 's/x/_/'
>   _^@y
> 
> It can even handle some empty matches perfectly:
> 
>   $ printf 'x\0y' | sed 's/[[:<:]]/_/' # pattern space survives
>   _x^@y
> 
> But not all of them:
> 
>   $ printf 'x\0y' | sed 's/[[:>:]]/_/' # pattern space is cut off
>   x_
>   $ printf 'x\0y' | sed 's/[[:>:]]/_/2' # likewise
>   x
> 
> One might argue that NUL bytes are not supposed to happen in textual
> input, so that these examples are irrelevant; the real problem however
> is that it also affects newlines, thus perfectly valid text:
> 
>   $ printf 'x\ny' | sed 'N;s/[[:<:]]/_/'
>   _x
>   y
>   $ printf 'x\ny' | sed 'N;s/[[:>:]]/_/'
>   x_
>   $ printf 'x\ny' | sed 'N;s/[[:>:]]/_/2'
>   x
> 
> So maybe substitute() should test for (slen == 0) instead of the
> questionable (*s == '\0' || *s == '\n').  That would seem more correct
> to me, but this is only a guess since I must admit that I don't fully
> understand that part of the code yet.  Here is a diff, for instance,
> that fixes all the above issues.
> 
> --- a/src/usr.bin/sed/process.c   Wed Jun  7 09:56:20 2017
> +++ b/src/usr.bin/sed/process.c   Tue Jun 13 07:21:59 2017
> @@ -387,14 +388,11 @@ substitute(struct s_command *cp)
>* and at the end of the line, terminate.
>*/
>   if (match[0].rm_so == match[0].rm_eo) {
> - if (*s == '\0' || *s == '\n')
> - slen 

Re: sed(1): missing NUL in pattern space

2017-06-13 Thread kshe
On Sat, 10 Jun 2017 10:25:27 +, Otto Moerbeek wrote:
> Thanks for the analysis and diff, I hope to get a chanche to think
> about this soon. At least I'll make sure this diff is not forgotten,
> -Otto

I have seen the tests that you recently added to the tree; in the mean
time, however, while I continued experimenting and reading the source, I
discovered more bugs (as expected), as well as other trivial issues
which might be worthy of note before committing any real fixes.

First of all, my comprehension of the `c' command has slightly changed
since my last message: I now think its behaviour should be aligned with
that of `p', `P', `w' and others, all of which produce no output after
`c', treating a deleted pattern space as truly inexistent, and not
merely as an empty one.  It would hence appear more coherent to add

if (pd)
break;

before calling lputs(), as is done in many other places.  This would
then change the expected output of one of the newly added tests.

In general though, that `c' command is a big mess.  It is supposed to
delete the pattern space: in other implementations of sed that I have
tested, that means it will essentially behave like `i' followed by `d'.
In BSD's sed, however, it continues processing commands...  Here's one
of the many bugs that arise because of that:

$ echo abc | sed 'c\
> text
> s/.*/x/
> p
> s/.*/y/
> p'
text
x

So why isn't `y' printed too?  In comparison, using flags instead of
standalone commands:

$ echo abc | sed 'c\
> text
> s/.*/x/p
> s/.*/y/p'
text
x
y

Here, we do get a `y', but only one (there should be two given that `-n'
has not been supplied to suppress normal output).  Since at least `p' as
a flag given to `s' appears to work, let us try the same with `w':

$ echo abc | sed 'c\
> text
> s/.*/x/w /tmp/sed.out
> s/.*/y/w /tmp/sed.out'
text
$ cat /tmp/sed.out
x

But this time it doesn't work.  Now add a third `s' command...

$ echo abc | sed 'c\
> text
> s/.*/x/w /tmp/sed.out
> s/.*/y/w /tmp/sed.out
> s/.*/z/w /tmp/sed.out'
text
z
$ cat /tmp/sed.out
x
z

And it works.  Well, not really: we do see a `z', but still no `y'.
Indeed, this does not make any sense.

The reason for this strange behaviour is nevertheless easy to figure out
when looking at how `s' is implemented, keeping in mind that the
substitute space is globally defined.  In any case, even if one was to
fix that particular bug, I would probably be able to find more.  This is
why I believe that the only sane solution is that, just like `d', `c'
should simply

goto new;

which would, at the same time, obviate the need for the `deleted' field
of the SPACE structure, subsequently allowing for substantial cleanups.

Also, aside of that `c' mess, there is still the fact that NUL bytes
originally present in the input are not well handled at all.  Of course,
`l' cannot print past them (whereas `p' can), which is a shame; but
significantly worse is the inconsistent behaviour of `s' in that case.
At first, it seems to work well (in the output, `^@' represents a NUL):

$ printf 'x\0y' | sed 's/x/_/'
_^@y

It can even handle some empty matches perfectly:

$ printf 'x\0y' | sed 's/[[:<:]]/_/' # pattern space survives
_x^@y

But not all of them:

$ printf 'x\0y' | sed 's/[[:>:]]/_/' # pattern space is cut off
x_
$ printf 'x\0y' | sed 's/[[:>:]]/_/2' # likewise
x

One might argue that NUL bytes are not supposed to happen in textual
input, so that these examples are irrelevant; the real problem however
is that it also affects newlines, thus perfectly valid text:

$ printf 'x\ny' | sed 'N;s/[[:<:]]/_/'
_x
y
$ printf 'x\ny' | sed 'N;s/[[:>:]]/_/'
x_
$ printf 'x\ny' | sed 'N;s/[[:>:]]/_/2'
x

So maybe substitute() should test for (slen == 0) instead of the
questionable (*s == '\0' || *s == '\n').  That would seem more correct
to me, but this is only a guess since I must admit that I don't fully
understand that part of the code yet.  Here is a diff, for instance,
that fixes all the above issues.

--- a/src/usr.bin/sed/process.c Wed Jun  7 09:56:20 2017
+++ b/src/usr.bin/sed/process.c Tue Jun 13 07:21:59 2017
@@ -387,14 +388,11 @@ substitute(struct s_command *cp)
 * and at the end of the line, terminate.
 */
if (match[0].rm_so == match[0].rm_eo) {
-   if (*s == '\0' || *s == '\n')
-   slen = -1;
-   else
-   slen--;
-   if (*s != '\0') {
-   cspace(, s++, 1, APPEND);
-   le++;
-  

Re: sed(1): missing NUL in pattern space

2017-06-10 Thread Otto Moerbeek
On Fri, Jun 09, 2017 at 07:58:48AM +, kshe wrote:

> Hi,
> 
> There is an ongoing confusion in sed's source about the concept of EOL:
> the code contradicts itself as to whether it should be determined by a
> trailing NUL or by looking up the `len' field of the SPACE structure.
> 
> At least two commands (`l' and `s') assume that the pattern space is
> always NUL-terminated.  However, at least two other commands (`c' and
> `D') do not comply with that assumption, modifying it by merely updating
> its length attribute.  As a result, using `l' or `s' after `c' or `D'
> produces incorrect output.
> 
> For instance, in the following excerpt, `l' should print `bb$':
> 
>   $ printf 'a\nbb\n' | sed '${l;q;};N;D'
>   $
>   bb
> 
> For the same underlying reason, it also disregards the deletion of the
> pattern space that `c' is supposed to entail:
> 
>   $ echo abc | sed 'c\
>   > text
>   > l'
>   text
>   abc$
> 
> The substitute command can likewise get confused and add an extra
> character after a replacement:
> 
>   $ printf 'a\nbb\n' | sed '${s/$/x/;q;};N;D'
>   bbxb
> 
> Note how this does not happen when the substitution pattern matches more
> than the empty string:
> 
>   $ printf 'a\nbb\n' | sed '${s/.$//;q;};N;D'
>   bbx
> 
> After an empty match, the character that `s' adds to the pattern space
> depends on how memmove(3) modified it beforehand.  Here's a very
> simplified version of the original sequence of commands from which I
> discovered these bugs, where `i' appear instead of `b':
> 
>   $ printf 'a\nbb\n' | sed '${s/$/x/;q;};N;s/a/ZZi/;D'
>   bbxi
> 
> Similarily, when one believes the pattern space to be empty after a `c'
> command, something like `s/.*//' will magically revive one character
> from that emptiness:
> 
>   $ echo abc | sed 'c\
>   > text
>   > s/.*//'
>   text
>   a
> 
> As I came across these in an actual script that I was writing, I was
> quite amazed to find out that they went unnoticed for so long, without
> anyone ever trying to see what `l' does after `c' or `D'.  Indeed, the
> `l' command is broken since revision 1.1 of process.c, making that bug
> older than I am.  It has also been present in the other BSDs, but:
> 
>   o FreeBSD fixed it by accident in 2004 by adding support for
> multibyte encodings;
> 
>   o NetBSD fixed it by accident in 2014 by importing FreeBSD's sed
> (merging their changes afterwards, but that didn't reintroduce
> the bug, since they hadn't touched that part of the code).
> 
> The more serious bug regarding empty matches in the substitute command
> was introduced in OpenBSD in 2011, when schwarze@ decided to rewrite its
> main loop, intending to "fix multiple issues regarding the replacement
> of zero-length strings", but apparently with the unfortunate side effect
> of introducing a new issue regarding the replacement of zero-length
> strings.  Finally, this commit was adopted by FreeBSD in 2016 when they
> synchronised some of their code with OpenBSD's to ease importing other
> fixes from it.
> 
> Thus, OpenBSD still has the old `l' bug as well as the more recent `s'
> one, FreeBSD only has the latter, and since 2014 NetBSD is not longer
> affected by any of these, although the fact that `c' and `D' omit to
> NUL-terminate the pattern space could be considered a bug in itself,
> since at least one piece of code present for 20 years in their tree was
> making the contrary assumption.  So who knows where else it was made?
> Actually, with all due respect, probably not even the original
> developers: many of their choices suggest that the end of the pattern
> space was not always meant to be encoded as a trailing NUL, but they
> nevertheless designed the function `lputs' to only take a simple string
> parameter, by which no separate length information could ever be
> accessible.
> 
> I therefore have the strong feeling that this is not the last bug of
> BSD's sed.  The remaining ones may be hidden in dark corners, but I do
> expect them to show up at some point in the future, considering how
> fragile the existing code looks.  The overall situation also prevents
> any significant improvement to be successfully conducted, as changing
> more than two lines will most likely lead to subtle regressions; see the
> last few commits at
> 
> https://svnweb.freebsd.org/base/head/usr.bin/sed/process.c
> 
> and
> 
> http://cvsweb.netbsd.org/cgi-bin/cvsweb.cgi/src/usr.bin/sed/process.c
> 
> for recent examples of such failed attempts.
> 
> Anyway, until I digress even more, here is a patch for that one, forcing
> `c' and `D' to put a NUL where `l' and `s' expect one to be.  By the
> way, whether the fault is on one side or the other is actually unclear.
> It does seem redundant to enforce such a convention when one already has
> access to the length of the corresponding string.  As a matter of fact,
> the overwhelming majority of the code does