Re: sed and delimiters that are also special characters to REs

Christoph Anton Mitterer via austin-group-l at The Open Group Fri, 14 Jan 2022 08:02:43 -0800

On Fri, 2022-01-14 at 09:07 +0300, Oğuz wrote:
> > And where does it say that? I mean in the standard.
> > I.e. where does it say, that parsing is only allowed to happen in
> > one
> > stage from left to right, especially not only with respect to an RE
> > itself, but also when an RE is embedded in a command with
> > delimiters.
> 
> It is what makes sense.


Other than that it's more efficient as it requires only one pass, I
wouldn't see why it should make more or less sense.

And as mentioned before, the wording of the standard IMO rather implies
the other behaviour.

POSIX always says something like "backslash followed by delimiter is
»literal«" (whatever literal is ought to mean).

It does not say:
"AN BY ITSELF NON-ESCAPED backslash followed by delimiter is »literal«"



> > Where does it say, whether:
> > s.[.].X.
> > is:
> > a) s/[/]/
> >    followed by X.
> > 
> > or:
> > 
> > b) s/[.]/X/
> 
> It is clearly `a'. The second period is not preceded by a backslash.

But that again just shows how ambiguous the standard is:
- if you prefer the left-to-right all-at-once parsing, then it's
ambiguous, because the standard leaves open which rules wins: the ones
for bracket expression, or the a-non-delimiter-would-need-to-be-
preceded-by-\

- if one prefers the two stages, where one looks first for unquoted
delimiter characters, which you say however "wouldn't be the one that
make sense", then it would be clearly (b).


And since one major implementation (GNU sed) would already make it
wrong, if it were (b),... it's IMO proof enough, that things aren't as
clear.



> > Then, with a delimiter that is also a special character, the
> > special
> > character would no longer be usable as such.
> 
> Again, that's why any character other than backslash and newline can
> be used as the delimiter.

You mean, that one could simply use another delimiter?

But I mean that's not the point here:
Of course one can. And I'm fine if one says: "it's simply not possible"
but this should than IMO also be clearly pointed out.



> > btw: busybox sed, behaves as you say:
> > $ printf '%s\n' '.' | busybox sed 's.\..X.'
> > X
> > $ printf '%s\n' 'a' | busybox sed 's.\..X.'
> > a
> 
> So do Solaris sed and Unixware sed. On the other hand, all seds I
> tested do this
> 
>     $ echo a | sed 's1(.)1\11'
>     a
> 
> except GNU sed; it prints 1 instead.

Uhm? What you write above is a completely different test?! Not using
bracket expressions. But it's another nice example, that the ambiguity
also affects back-references

However:

(GNU sed) 4.8 shows the following here:
$ echo a | sed 's1(.)1\11'
a
$ echo a | busybox sed 's1(.)1\11'
a



>  The behavior varies widely with
> other special characters and sequences, so the best standard
> developers could do is to state that the results are unspecified when
> they are used as delimiters; OR, leave the text as is, which already
> implies that it is unspecified.

Which the standard however doesn't do as of now. I'd also be fine with
it, if the standard wouldn't mandate "one correct behaviour" when
"special" characters are used as delimiter,... but it should then
clearly say that for such characters behaviour is undefined (and indeed
varies with major implementations) and also name which are all
dangerous:
- at least all special characters (with respect to BRE/ERE)
- digits 1-9
- &


Cheers,
Chris.

Re: sed and delimiters that are also special characters to REs

Reply via email to