Re: [Issue 8 drafts 0001550]: clarifications/ambiguities in the description of context addresses and their delimiters for sed

Christoph Anton Mitterer via austin-group-l at The Open Group Sun, 17 Apr 2022 16:46:56 -0700

On Tue, 2022-04-05 at 15:54 +0100, Geoff Clare via austin-group-l at
The Open Group wrote:
> > -------------------------------------------------------------------
> > --- 
> >  (0005771) calestyo (reporter) - 2022-04-02 01:53
> >  https://austingroupbugs.net/view.php?id=1550#c5771 
> > -------------------------------------------------------------------
> > --- 
> 
> > So maybe, in "Addresses in sed" we should better *only* describe
> > the \cREc
> > form of these,... and link to "Regular Expressions in sed" for how
> > delimiters are escaped?
> 
> I think it works better the way I have it (which you said you could
> live
> with).

There are IMO two aspects here:
- I think it makes it a bit less unorganised, because from the 2nd
  sentence on, it's less about the context address, but more about
  something resulting from these (the delimiter) as part of the RE.

- A key part in (definitely) understanding how the delimiters (and
  '\n' as newline) in the RE part of context addresses and s-commands
  work semantically is, that the RE language itself is extended by
  these.

In the "Addresses in sed" section this is kinda done via "The BRE and
ERE syntax shall additionally support"

For the s-command it's however not that directly emphasised.
(See my proposal further below.)

That's why I had thought putting all into in "Regular Expressions in
sed" and emphasising there that for sed, the RE languages are extended
by the following which is considered to be part of them, could make
things better..

Of course however, this wouldn't work for the y-command.

Anyway... your decision.

issue #1550, note #5771, point (c): I assume this *really* is on
purpose?!

> > (Ic), (Id) as well as my original (2b) would be fixed, if we'd
> > write
> > something like:
> > "When the delimiter character c is <slash>, a context address \/RE/
> > can
> > also be written as /RE/." (or something similar but better).
> > That would make it clear that \/RE/ is allowed and identical to
> > /RE/ and at
> > the same time define /RE/.
> 
> I'd suggest just changing the "example" part of my proposal.
> I.e. instead of:
> 
>     For example, the context address "\xabc\xdefx" is equivalent to
>     "/abcxdef/".
> 
> it could say:
> 
>     The construction "\cREc" does not need to be used when the
> delimiter
>     is a <slash>; for example, the context address "\xabc\xdefx" is
>     equivalent to "/abcxdef/".

I guess I could live with that... if I had to ;-)

It's still a bit non-straightforward, I think,... we have now:
- > context address (which consists of an RE, as described in Regular
  > Expressions in sed, preceded and followed by a delimiter, usually a
  > <slash>).

  This basically says that a context-address is:
  <delimiter>RE<delimiter>
  and that delimiter is usually a <slash>, which alone doesn't say
  strictly anything whether or not it needs to be escaped.

- > In a context address, any character other than <backslash> or
  > <newline> can be specified for use as the delimiter by means of
  > the construction "\cREc", where c is the chosen delimiter 
  > character.

  This basically says: Anything else than <backslash> or <newline>
  (thus including <slash>) can be used via "\cREc".

- > The construction "\cREc" does not need to be used when the
  > delimiter is a <slash>
  (from the current proposal in:
  https://austingroupbugs.net/view.php?id=1550#c5790 )

  This kinda works... but what I dislike about it is, that it says the
  "[whole] construct" doesn't need to be used...

All these together seems like some circular dependency to me.

But the actual point would be: If <slash> is used as delimiter, the
first one in cREc doesn't need to be escaped with <backslash>.

That's what I'd thought my proposed:
> When the delimiter character c is <slash>, a context address \/RE/
> can also be written as /RE/."

would nicely resolve.

Maybe a completely different way to go would be to define a context
address as:
dREd
with d being the delimiter, RE the regular expression and with d being
any character other than <backslash> and <newline>, and any character
other than <slash> needing to have the first 'd' (only!) escaped with
<backslash> as in \dREd ... and <slash> *may* have it's first
occurrence escaped.

But,... I'd guess that any reader, who's a bit familiar with sed, would
still be able to realise from your current proposal what's meant.

So, if you wish, keep it as is... from my side.

> > Oh, and if you should change your proposed text,... could you
> > please always
> > make a new post
> 
> I expect that too much will change to make it reasonable to edit in
> place anyway.

*cough* git-based-workflow *cough* O;-)

> > -------------------------------------------------------------------
> > --- 
> >  (0005777) calestyo (reporter) - 2022-04-02 19:47
> >  https://austingroupbugs.net/view.php?id=1550#c5777 
> > -------------------------------------------------------------------
> > --- 
> > That indented paragraph of yours (in Note 0005775) should (if at
> > all) only
> > go to the Rationale, IMO. At least the part which describes *why*
> > <backslash> and <newline> cannot be used.
> 
> I'll put something similar in rationale.

Regarding that:

> (even in a context address using "\cREc")

Is there anything special about context address vs. the other "place"
where this is relevant (i.e. s-command) which I miss, so that you
specifically mention the context addresses here?

I just ask because *not* mentioning the "obvious" other "place" where
the same thing should apply, could lead people to question whether/why
something may be different there.

> <backslash> does not work, because if it appears unescaped later in
> the RE, it either escapes the following character, which can then
> never be the ending delimiter

It feels a bit is missing here... namely that it couldn't be decided,
whether an unescaped '\' would escape the following character OR be the
delimiter.
The above sentence seems to assume already that such unescaped '\'
wouldn't be the delimiter but rather the RE escape character... but
that is already impossible to decide in the first place (which is why
it cannot be used)?!

And even if it was decided... and '\' was the escape character (and not
the delimiter)... then:
In: "it either escapes the following character, which can then never be
the ending delimiter" ... isn't that anyway never the case?!

If '\' is the escape character (and not the delimiter), then e.g. in
'\.' the '.' would of course never be the delimiter - and if it was the
delimiter that the '.' wouldn't be either!?

So that "which can then never be the ending delimiter" seems a bit
strange or I just don't understand it.

> or it forms part of a bracket expression

"or it *is* part..."?
Just cosmetic ... "form" IMO indicates that it builds something
together with something else (like \ and n could form the escape
sequence for <newline>)

> > -------------------------------------------------------------------
> > --- 
> >  (0005780) calestyo (reporter) - 2022-04-05 00:59
> >  https://austingroupbugs.net/view.php?id=1551#c5780 
> > -------------------------------------------------------------------
> > --- 
> > Still, as I propose in
> > https://austingroupbugs.net/view.php?id=1556#c5778
> > point (c) I'd make this more clear by directly saying, that sed's
> > additions '\n' (for newlines) and '\c' (for escaped delimiter) are
> > -
> > with respect to sed, considered part of the RE respectively
> > replacement
> > language... and that the whole command string (context address
> > respectively s-command) is parsed in one go from left to right.
> 
> We can't specify parsing in one go

Okay my wording was unclear. What I meant with "one go", was that the
whole context address respectively the whole s-command is considered to
be one thing (semantically), i.e. that the escape sequences \c with c
for the delimiter and \n for newline are considered to be be part of
the RE (with respect to sed).

IMO, it's like I've tried to describe in issue #1551, note #5780, point
(I):

With the *old* wording (that in the 2.1 draft)
> ...character designated by c appears following a <backslash>...

it was ambiguous as to how the string is to be interpreted.

With new wordings (from your new proposal, but also already with those
from your old one):
- > shall additionally support escaping occurrences of the delimiter
  > within the RE by means of an escape sequence "

- > the delimiter shall not terminate the RE or replacement if it is
  > the second character of an escape sequence

it became already *though indirectly* clear, because in order to
determine whether or not \c respectively \n is an escape sequence, the
other characters at the string need to be looked at (which wasn't the
case when the wording was just "c preceded by <backslash>").

...

> What matters is that the delimiter can only be escaped with an
> _unescaped_ backslash, and that it doesn't end the RE when it is in a
> bracket expression. I believe my proposal makes both of those things
> clear.

... so, yes, strictly speaking it's already contained in the current
proposal, but IMO it requires some thought to deduce it and would be
better for a reader (that is not an author of the standard) if it was
clearly named, which is what I've had proposed in some places before
and repeated in this mail’s first replying paragraph.

What about e.g. the following:

- Instead of (your addition after line 106070):
  > The BRE and ERE syntax shall additionally support escaping
  > occurrences...

  something like:

  > With sed and context addresses, the BRE and ERE syntax shall be
  > extended to additionally support escaping occurrences...

- Instead of (line 106085):
  > Both BREs and EREs shall also support the following additions:

  something like:

  > With sed, the BRE and ERE syntax shall be extended to
  > additionally support the following:

- Instead of (your change at line 106204):
  > Within the RE and the replacement

  where currently there is no direct mentioning of this being an
  extension (which is what I've been talking about in the beginning of
  this mail), something like:

  > With sed and the s-command, the BRE and ERE syntax shall be
  > extended so that within the RE and the replacement

Anyway just an idea,... your decision.

> I am quite sure that any attempt to require one side or the other to
> change would not achieve consensus, so please drop this.

Well,... okay then for me :-)

> > Geoff's current proposal does not explicitly allow an
> > implementation
> > to choose the behaviour (literal vs. special) for delimiters whose
> > character c would only become special if escaped - but it doesn't
> > explicitly forbid it either (and that's what's IMO missing).
> > 
> > I just saw (at the end of my review) that Geoff's proposal already
> > indicates this whole problematic, in the added paragraph:
> > "Some historical sed implementations..."
> > 
> > Still I think we need to more explicitly rule this out outside of
> > the "RATIONALE" section.
> 
> The proposed normative text clearly forbids it, and the rationale
> points out that it forbids it.  I see no reason to do anything more
> here.

Personally, I still think that this should be said more directly (and
not "hidden" behind the fact that such characters are not "special
characters" but merely "characters that may get 'special meaning'").
It seems all to easy to consider e.g. in BRE '(' as "somewhat" special.

IMO again one of these cases, which may seem absolutely clear when one
has a deep knowledge of all this, but less easy to understand for a
more casual reader.

Even if the BusyBox case is considered just a bug, it shows how easy it
could happen, that an implementation might think it could also choose
the behaviour (literal vs. special) for such characters only get a
special meaning when escaped (like in BRE '(', or 's' in GNU).

Therefore I'd say it wouldn't harm to emphasise this, e.g. by adding a
short sentence in parentheses to your above proposal (which dealt with
the s-command) like:
> (This includes characters that are not special, but would obtain a
> special meaning if escaped, like '(', '+' or the digits '1' to '9' in
> BREs. See the RATIONALE.)

And the same into the part of the proposal which dealt with the context
addresses (which has a similar sentence).

So, yes, you're right, it's in principle already said (also with your
"according to") but because of the similarity of "special character"
and "character that may get 'special meaning'" I think it would be nice
to have it even more directly mentioned.

Anyway... again your decision, I'll be able to live with it.

> > b) I wouldn't write:
> >     "is not special in a BRE or ERE" <--- this exists in two
> > locations
> > but rather
> >     "is not special in __the__ BRE __respectively__ ERE"
> > or something better.
> > ...

> I think the best way to address this would be to include mention of
> whether -E is specified:
> 
>     ... in a BRE (if the <b>-E</b> option is not specified) or ERE
> (if
>     <b>-E</b> is specified) ...

Looks good to me.

I'd perhaps just write "is [not] in EFFECT" rather than "is [not]
specified",... so that this better covers implementations that offer
alternative spellings to '-E' (like GNU's --regexp-extended or -r)...
or if implementations had a -B option to specifically choose BREs and
would allow multiple of them "-B -E -B", with the last one being taken.

> > I think we could go back and just call that "escape 'c'" or "escape
> > sequence 'c'", though I would personally prefer to retain the
> > parentheses with a hint like "(there can't be escape
> > characters/sequences inside bracket expressions)"
> 
> See my reply above to kre's note 5774.  Rather than mention bracket
> expressions in parentheses, I'd prefer to reference XBD 9.1 (as
> updated
> by bug 1546) for the details of how/where escaping works.

Well... guess that comes down to different philosophies.

Just referring to some chapter may be enough for someone who deals with
the standard on a daily basis,... but it would be nice if it provides
such key information (that one might not think about immediately) in
e.g. parentheses, which simply makes understanding a lot easier, rather
than just referring to another chapter (which may be large and one may
not even know *what's* now the referenced part in that).

> > d) Cosmetics:
> > In some places the wording "escape sequence <backslash>c" is
> > used...
> > but in others e.g. "escape sequence '\n'".
> 
> That's intentional, and is because the "c" in "<backslash>c" is
> italic,
> to indicate that it stands for any character. 

Ah, ok, I see :-)

> > e) Instead of:
> > "The delimiter character that precedes and follows the RE shall not
> > terminate the RE when it appears within a bracket expression. For
> > example, the context address "/[/]/" is equivalent to "/\//"."
> > 
> > "The delimiter character that precedes and follows the RE shall not
> > terminate the RE when it appears within a bracket expression __but
> > be that literal character for the bracket expression__. For
> > example,
> > the context address "/[/]/" is equivalent to "/\//"."
> 
> It might be worth altering this somehow, but "literal" is wrong
> (specifically if the delimiter is '^' or '-', or things like ':' in
> [[:alpha:]]).

Good catch.

I saw your proposal, and while "normal" alone wouldn't have been enough
for me (because the subtle difference between "normal" vs. "literal or
special" is not necessarily obvious)... but I can live with it quite
well, given how you've updated the example (which makes the meaning of
"normal" clear)!

Apart from that, I still think that the first part of the current
example »"/[/]/" is equivalent to "/\//"« is a bit unfortunate.

The point of the example is trying to show that the / inside a bracket
expression isn't a delimiter, which is however shown right now by
altering the structure of the RE (in doing away with the bracket
expression).

>From a teaching PoV, I'd have found these better:
  "\%[%]%" is equivalent to "/[%]/"
or alternatively:
  "/[/]/" is equivalent to "\%[/]%"

Which replace just the delimiter, but leave the RE WITH the bracket
expression AND the literally meant delimiter character the same.

> > => And perhaps something like "should put it inside a bracket
> > expression __with not other characters__" to make clear, that one
> > cannot re-use one e.g. 'sX\X[0-9]XfooX' can NOT be written as
> > 'sX[X0-9]XfooX' but only as 'sX[X][0-9]XfooX'.
> 
> Incorrect, sX[X0-9]XfooX is required to "work". 

Well "to work", sure... but not for the purpose that you describe in
that section (getting a literal 'c' - which in my example would have
been an 'X'.
If you add that into a bracket expression with other characters, it's
syntactically correct, but you get a different meaning, because it may
be "'X' or any of '0' to '9'".

I've seen your follow up discussion with Robert, and the current
proposal:
> Applications that use a special RE character as a delimiter (for
> example, '.' or '*') and need to use the delimiter as a literal
> character in the RE should put it inside a bracket expression,

=> here it's IMO still wrong or uncler, cause *for the purpose of
   making the
   delimiter literal in the RE* it must simply be put in it's own
   bracket expression... and not inside "a bracket expression" (with
   possibly other characters).

> as
> implementations differ regarding whether escaping it with a
> <backslash> removes its special meaning. For example, the context
> address "\.[.][0-9]." is equivalent to "/\.[0-9]/"; however, with
> "\.\.[0-9]." it is unspecified which of "/\.[0-9]/" or "/.[0-9]/" is
> its equivalent form.

=> IMO it's true what's said, but it does a suboptimal job in
   explaining.
   The paragraph starts with:
   > Applications that use a special RE character as a delimiter (for
   > example, '.' or '*') and need to use the delimiter as a literal

   ... that's what we want to explain, right?

   So I'd swap the first part of the example sentence and say:
   > For example, if the context address "/\.[0-9]/" shall be written
   > with '.' as delimiter the form "\.[.][0-9]." must be used.

   Starting off with the form that uses the bracket expression, while
   that is the desired outcome of the whole exercise, is IMO not ideal.

   It's good, that:
   > however, with "\.\.[0-9]." it is unspecified which of "/\.[0-9]/"
   > or "/.[0-9]/" is its equivalent form.

   is mentioned, but I think the key info is missing, namely:
   that '\.\.[0-9].' wouldn't work because of the 2nd '\.' having
   unspecified meaning (literal vs. special).

   So why not writing something like:
   > however, "\.\.[0-9]." couldn't be used portably for this purpose,
   > as it is unspecified whether this would be equivalent to
   > "/\.[0-9]/" or "/.[0-9]/"

   That also adds the hint that this only doesn't work if portability
   is desired.

On Thu, 2022-04-07 at 10:37 +0100, Geoff Clare via austin-group-l at
The Open Group wrote:
> The proposed solution implies that the delimiter
> character can't be put in a bracket expression with other characters,
> even if that's the behaviour the user wants.

No?! The whole paragraph in the RATIONALE was about how to get the
literal meaning (with respect to the RE) of a special character (with
respect to the RE) when that is also used as delimiter (in which case
escaping it won't work (portably), because of this being either literal
OR special).

If that's the situation, then the user doesn't want any other
behaviour.

> > => But anyway,... the above sentence would need to exclude [^]
> > then...
> 
> Red herring. See my comment in bug 1575.

Correct for the matter of [^] by itself,... but again one of these
cases where the standard could with a few words be more explanatory:

> Applications that use a special RE character as a delimiter (for
> example, '.' or '*') and need to use the delimiter as a literal
> character in the RE should put it inside a bracket expression

Could simply tell that this won't work for '^' and not leave it up to
the reader to realise that '[^]' isn't valid.

Similarly as it doesn't leave it up to the reader to realise that
<backslash> and <newline> cannot be used, but directly telling it.

> > VI) not really related to this issue, but it would make things even
> > more complex if I add it in a separate ticket:
> > 
> > The description of the y-command contains on page 3138, line
> > 106249:
> > "If the number of characters in string1 and string2 are not equal,
> > or if any of the characters in string1 appear more than once, the
> > results are undefined."
> > 
> > That is strictly speaking wrong, namely in the case when string1
> > and/or string2 contains '\'-escaped 'n' (for newline) or a '\'-
> > escaped
> > delimiters, and the number of occurrences in both strings don't
> > even out.
> > 
> > => Perhaps simply write "If the number of characters (after
> > resolving any escape sequences)..." or so?
> 
> That part of the y command description is not being touched by the
> changes to fix 1550 and 1551, so it would be best to raise a separate
> bug for this.

Done so in https://www.austingroupbugs.net/view.php?id=1578

* Had you considered issue #1551, note #5780, point (II)?
  In the RATIONAL you already explain how one can get the literal 'c'
  in the RE when 'c' is the delimiter.

  Conversely it might make sense to tell that there is no such way to
  get the special meaning (for sure).

* In your example:
  > and the command "s-[0-9]--g" is equivalent to "s/[0-9]//g"

  Why the g flag? I'd rather drop that if it doesn't serve any
  particular purpose.

The following points of issue #1556, note #5778, haven't been dealt
with, yet:

* (e): shware_systems finding bout the somewhat strange sentence on
       page 3137, line 10622 which says:
       > Any <backslash> used to alter the default meaning of a
       > subsequent character shall be discarded from the RE or the
       > replacement before evaluating the RE or using the
       > replacement."

       Is that meant for things like '\('?
       Does it apply only to the replacement?
       The whole sentence is a bit strange, IMO.

* (f): Whether the additionally defined characters that may validly
       follow an escaping '\' (namely \c and \n) should be mentioned in
       the RE character.

* (g): The first part, i.e what '\n' in snAAA\nnXXXn is?
       => newline or the "undelimitered" delimiter character?

       [The second part ('snAAA[\n]nXXXn') got IM clarified by your
       sentence:
       > The delimiter character that precedes and follows the RE
       > shall not terminate the RE when it appears within a bracket
       > expression…
       AND
       by '\n' as newline escape sequence not being able to be that in
       a bracket expression]

Other than that... the current proposal looks good.

Thanks,
Chris.

Re: [Issue 8 drafts 0001550]: clarifications/ambiguities in the description of context addresses and their delimiters for sed

Reply via email to