[Issue 8 drafts 0001556]: clarify meaning of \n used in a bracket expression in a sed context address or s-command

Austin Group Bug Tracker via austin-group-l at The Open Group Sat, 02 Apr 2022 19:00:55 -0700


A NOTE has been added to this issue. 
====================================================================== 
https://austingroupbugs.net/view.php?id=1556 
====================================================================== 
Reported By:                calestyo
Assigned To:                
====================================================================== 
Project:                    Issue 8 drafts
Issue ID:                   1556
Category:                   Shell and Utilities
Type:                       Clarification Requested
Severity:                   Objection
Priority:                   normal
Status:                     New
Name:                       Christoph Anton Mitterer 
Organization:                
User Reference:              
Section:                    Utilities, sed / 9.3.5 RE Bracket Expression 
Page Number:                - 
Line Number:                - 
Final Accepted Text:         
====================================================================== 
Date Submitted:             2022-01-18 01:07 UTC
Last Modified:              2022-04-03 01:59 UTC
====================================================================== 
Summary:                    clarify meaning of \n used in a bracket expression
in a sed context address or s-command
======================================================================
Relationships       ID      Summary
----------------------------------------------------------------------
related to          0001550 clarifications/ambiguities in the descr...
======================================================================


---------------------------------------------------------------------- 
 (0005778) calestyo (reporter) - 2022-04-03 01:59
 https://austingroupbugs.net/view.php?id=1556#c5778 
---------------------------------------------------------------------- 
Geoff, I've looked at your proposal at
https://austingroupbugs.net/view.php?id=1550#c5761 and with respect to this
ticket I'd say the following:

Then obvious consensus seems to be, that regardless of context addresses or
s-command.... in any RE (BRE respectively ERE), anything within a bracket
expression (except for the characters special to that itself, like a
leading '^' or a '-' that is part of a range) is considered to be literal.
Thus '\' never serves as an escape character, thus there are no escape
sequences with a RE bracket expression.

Right?!




a)

It was said in previous note, that this would have already been clear,
because the paragraph says "The escape sequence '\n'…",... and in a RE
bracket expression, this would simply not be an escape sequence as the '\'
is literal.

In a way that feels temptingly reasonable, especially with the upcoming
changes that Geoff mentioned in
https://austingroupbugs.net/view.php?id=1546#c5768 (with which the RE
chapter now clearly defines what an escape sequence is).

However...

- There are many kinds of escape sequences in POSIX. It's IMO not
absolutely clear by the term itself, that the escape sequence '\n' defined
on page 3134 line 106092 in the sed chapter, is necessarily of the same
kind than those of the REs.
After all, the escape sequence defined by the sed chapter ('\n' and '\c'
with c being a delimiter) could be interpreted as a completely different
layer of escape sequences, e.g. parsed in a first step, while the RE
resulting from that first pass is only parsed afterwards.

- It *does* in fact get more clear, when one realises the strong weight of
the sentence "Both BREs and EREs shall also support the following
additions:" which supposedly kinda means.. »the following points
(including the definition of \n and \c) are an additional part of the RE
language, with respect to sed«.

So in a way you're right,... strictly speaking it's already there,... but
IMO that really requires to think around several edges:
* first realising: this (\n and \c) is not just some sed specific thing,
but that in the context og sed it's considered to be part of the RE
language
* then realising: in REs, it's (since recently) clear that the terms
"escape sequence/character" always means those, that really act as such...
and don't just define the string/character.
* then realising: in REs, there's also the rule that <backslash> is literal
within bracket expressions
* then following, that e.g. '[\n]' is the same than [n\], i.e. either 'n'
or <backslash>


IMO, the fact that GNU sed breaks POSIX here so drastically (whereas in
most other cases it would do it much more gracefully or rather just extend
POSIX behaviour) seems already like an indicator, that it's not that
obviously clear, even for experts. (Unless of course, GNU deliberately
chose that way.)

Also, the standard isn't just read by experts who know it more or less by
hard ;-) ... and immediately understand every phrasing and it's deeper
meaning.


=> For that reason, I still think, that it wouldn't harm if the definition
in line 106092 would be amended by a small hint like "(note that within the
bracket expression of a RE, <backslash> is always taken literally and not
as escape character and thus there cannot be any escape sequences within
such bracket expression)".

(For '\c', as escaped delimiter, you also mention that - in the current
proposal in "Addresses in sed" and in the s-command. You even say
"unescaped <backslash> (except inside a bracket expression)"... so why not
the same for '\n'?)




b)

If it would be more obviously clear, that in the context of sed, the escape
sequence '\n' is not just some sed speciality, but an addition to the RE
language, things like:
  "So \n does not match newline when it appears within \\n for example."
would also become more obvious.

That is, it would become clear, that sed cannot just e.g. first look for
any '\n' and replace them with <newline> and only then does the RE parsing
on the result.


=> So perhaps, instead of
   "Both BREs and EREs shall also support the following additions:"
something like:
   "In the context of sed, REs should be considered as if they had been
defined with the following additions (with other rules from REs also
applying to them, too):"
?




c)

I've written in several places, e.g. also in ticket #1551, that it's IMO
not definitely clear how sed parses scripts (context addresses or
s-command) that include RE:
- it could e.g. first look for any sed specialities, like '\n' (for
newline) or '\c' (for a escaped delimiter), process those and only then
parse the RE (exactly according to the RE rules)
- or it could do all in one pass from left to right, considering sed
specialities ('\n' and '\c') as integral part of the RE

I guess it's informally clear, that it's the latter.

The proposed change in (b) would make this IMO already quite clear, at
least if the part that describe these extensions all move back to "Regular
Expressions in sed". I mean the parts which describe how the escape
sequence '\c' (with c the delimiter) is processed, in the current
proposal:
- in "Addresses in sed"
- and in the s-command 


=> Perhaps one should additionally also add something like:
"The delimiters, RE and any extensions to regular expressions from sed
(like '\n' and \c) are processed in one pass."




d)

If:
- the upcoming changes that Geoff mentioned in
https://austingroupbugs.net/view.php?id=1546#c5768
- what I proposed above in (a), (b), (c)
were adopted,...

... then I'd agree with what KRE said above in note #5623 and my objections
from note #5624 would become moot, except for one point perhaps:

In the current proposal, there is still:
'In a context address, any character other than <backslash> or <newline>
can be specified for use as the delimiter by means of the construction
"\cREc", where c is the chosen delimiter character.'

For that very first \c (which isn't an escaped delimiter in the sense of
"the RE should see it as literal") it's never mentioned that this is an
escape sequence, but it's rather called "construction".

Only, if one again thinks around some edges, one realises:
- '\\cREc' wouldn't be something strange like, a literal \ followed by a
delimiter c of a context address... it would be the try to make '\' the
delimiter and 'c' would actually be part of the RE.
- However, having '\' as delimiter is forbidden. Thus there's no strict
need to mention that the '\c' must be an escape sequence respectively
mustn't be preceded by an unescaped <backspace>.


I still wonder whether we should hint that somehow, like a »('\c' is
always an escape sequence as its escape character cannot be validly
escaped)«.
But I wouldn't insist on that (not that my opinion would matter much anyway
;-) ).




e)

shware_systems mentioned in note #5625 above, the page 3137, line 10622
which says:
"Any <backslash> used to alter the default meaning of a subsequent
character shall be discarded from the RE or the replacement before
evaluating the RE or using the replacement."

=> I would start a new paragraph before that. It doesn't seem to be related
to the other sentence ("A substitution shall be considered to have been
performed") of that paragraph.

Second, what does it mean?

- Any <backslash> used to alter the default meaning of the following
character...
  That would include:
  - '\n'
  - '\c' (with c being the delimiter)
  - with the changes proposed in
https://austingroupbugs.net/view.php?id=1546, depending on the
implementation '\+'
  - in principle even the special chars from BRE/ERE and those that become
special when preceded by backslash
- And it's  discarded *before* the RE is evaluated, respectively the
replacement is used.

As shware_systems, says... that could e.g. be interpreted that [\n] would
be a bracket expression with <newline>.

=> Doesn't that sentence need to be clarified?




f)

Page 167, line 5802 says:
"When not inside a bracket expression, the interpretation of an ordinary
character preceded by an
unescaped <backslash> is undefined, except for..."

and page 173, line 6044 says:
"When not inside a bracket expression, the interpretation of an ordinary
character preceded by an unescaped <backslash> is undefined, except
for…"

The additions by sed ('\c' and '\n') obviously touch this.

Shall we do anything about that?
Like e.g. adding a note to the RE chapter, that for REs in the context of
sed, there are more definitions?

I wouldn't strongly insist on that (not that my opinion would count much
anyway ^^)... but since implementations do extend the RE language (like
GNU's '\s', etc.) it might be reasonable to note, that in the context of
sed, the escape secquence '\n' is already used and '\c' with c being the
delimiter is also special.




g)

The last unclear point, with respect to this ticket, is IMO:

What the following means:
 snAAA\nnXXXn

Is '\n' a:
- escaped delimiter
  and thus the command equivalent to: s/AAAn/XXX/
or a:
- newline
  and thus the command equivalent to: s/AAA<newline>/XXX/
??


The following is tricky:
 snAAA[\n]nXXXn

It wwould IMO be already clear, if:
- the upcoming changes that Geoff mentioned in
https://austingroupbugs.net/view.php?id=1546#c5768
- what I proposed above in (a), (b), (c)
were adopted.
I.e. '\n' (for newline) and '\c' (which is here also '\n') would need to be
considered part of the RE language (in the context of sed).

Only then, it would be fully clear, that the s-command above equivalent to:
's/AAA[n/]/XXX/' which is the same as 's%AAA[n/]%XXX%'




h)

Unrelated to this ticket, but since it just fits in:
What in the standard makes it clear that in 's/AAA[n/]/XXX/' the 2nd '/'
(the one in the bracket expression) is not a delimiter?

It's the non-escaped delimiter, which is (unlike \c) is not defined to be
part of the RE language... so without what I propose in (c) above (which
also includes non-escaped delimiters) one could still think of a
hypothetical 2-pass parsing that gives:
         RE: AAA[n
replacement: ]
      flags :XXX/

The RE rules on page 168, line 5840 merely say that '\' and the BRE
respectivel ERE character loose their special meanings (in bracket
expressions of BREs respectively EREs). '/' however isn't special to any of
them... so one could argue, that it retains the special meaning (as
delimiter). 

Issue History 
Date Modified    Username       Field                    Change               
====================================================================== 
2022-01-18 01:07 calestyo       New Issue                                    
2022-01-18 01:07 calestyo       Name                      => Christoph Anton
Mitterer
2022-01-18 01:07 calestyo       Section                   => Utilities, sed /
9.3.5 RE Bracket Expression
2022-01-18 01:07 calestyo       Page Number               => -               
2022-01-18 01:07 calestyo       Line Number               => -               
2022-01-18 09:41 geoffclare     Note Added: 0005621                          
2022-01-18 13:26 calestyo       Note Added: 0005622                          
2022-01-18 16:30 kre            Note Added: 0005623                          
2022-01-18 17:12 calestyo       Note Added: 0005624                          
2022-01-18 18:41 shware_systems Note Added: 0005625                          
2022-01-18 21:17 calestyo       Note Added: 0005626                          
2022-01-18 21:19 calestyo       Note Edited: 0005626                         
2022-03-25 16:37 geoffclare     Note Added: 0005762                          
2022-03-31 16:00 nick           Relationship added       related to 0001550  
2022-04-02 02:38 calestyo       Note Edited: 0005622                         
2022-04-02 22:00 calestyo       Note Edited: 0005626                         
2022-04-03 01:59 calestyo       Note Added: 0005778                          
======================================================================

[Issue 8 drafts 0001556]: clarify meaning of \n used in a bracket expression in a sed context address or s-command

Reply via email to