A NOTE has been added to this issue. ====================================================================== https://austingroupbugs.net/view.php?id=1556 ====================================================================== Reported By: calestyo Assigned To: ====================================================================== Project: Issue 8 drafts Issue ID: 1556 Category: Shell and Utilities Type: Clarification Requested Severity: Objection Priority: normal Status: New Name: Christoph Anton Mitterer Organization: User Reference: Section: Utilities, sed / 9.3.5 RE Bracket Expression Page Number: - Line Number: - Final Accepted Text: ====================================================================== Date Submitted: 2022-01-18 01:07 UTC Last Modified: 2022-04-03 01:59 UTC ====================================================================== Summary: clarify meaning of \n used in a bracket expression in a sed context address or s-command ====================================================================== Relationships ID Summary ---------------------------------------------------------------------- related to 0001550 clarifications/ambiguities in the descr... ======================================================================
---------------------------------------------------------------------- (0005778) calestyo (reporter) - 2022-04-03 01:59 https://austingroupbugs.net/view.php?id=1556#c5778 ---------------------------------------------------------------------- Geoff, I've looked at your proposal at https://austingroupbugs.net/view.php?id=1550#c5761 and with respect to this ticket I'd say the following: Then obvious consensus seems to be, that regardless of context addresses or s-command.... in any RE (BRE respectively ERE), anything within a bracket expression (except for the characters special to that itself, like a leading '^' or a '-' that is part of a range) is considered to be literal. Thus '\' never serves as an escape character, thus there are no escape sequences with a RE bracket expression. Right?! a) It was said in previous note, that this would have already been clear, because the paragraph says "The escape sequence '\n'…",... and in a RE bracket expression, this would simply not be an escape sequence as the '\' is literal. In a way that feels temptingly reasonable, especially with the upcoming changes that Geoff mentioned in https://austingroupbugs.net/view.php?id=1546#c5768 (with which the RE chapter now clearly defines what an escape sequence is). However... - There are many kinds of escape sequences in POSIX. It's IMO not absolutely clear by the term itself, that the escape sequence '\n' defined on page 3134 line 106092 in the sed chapter, is necessarily of the same kind than those of the REs. After all, the escape sequence defined by the sed chapter ('\n' and '\c' with c being a delimiter) could be interpreted as a completely different layer of escape sequences, e.g. parsed in a first step, while the RE resulting from that first pass is only parsed afterwards. - It *does* in fact get more clear, when one realises the strong weight of the sentence "Both BREs and EREs shall also support the following additions:" which supposedly kinda means.. »the following points (including the definition of \n and \c) are an additional part of the RE language, with respect to sed«. So in a way you're right,... strictly speaking it's already there,... but IMO that really requires to think around several edges: * first realising: this (\n and \c) is not just some sed specific thing, but that in the context og sed it's considered to be part of the RE language * then realising: in REs, it's (since recently) clear that the terms "escape sequence/character" always means those, that really act as such... and don't just define the string/character. * then realising: in REs, there's also the rule that <backslash> is literal within bracket expressions * then following, that e.g. '[\n]' is the same than [n\], i.e. either 'n' or <backslash> IMO, the fact that GNU sed breaks POSIX here so drastically (whereas in most other cases it would do it much more gracefully or rather just extend POSIX behaviour) seems already like an indicator, that it's not that obviously clear, even for experts. (Unless of course, GNU deliberately chose that way.) Also, the standard isn't just read by experts who know it more or less by hard ;-) ... and immediately understand every phrasing and it's deeper meaning. => For that reason, I still think, that it wouldn't harm if the definition in line 106092 would be amended by a small hint like "(note that within the bracket expression of a RE, <backslash> is always taken literally and not as escape character and thus there cannot be any escape sequences within such bracket expression)". (For '\c', as escaped delimiter, you also mention that - in the current proposal in "Addresses in sed" and in the s-command. You even say "unescaped <backslash> (except inside a bracket expression)"... so why not the same for '\n'?) b) If it would be more obviously clear, that in the context of sed, the escape sequence '\n' is not just some sed speciality, but an addition to the RE language, things like: "So \n does not match newline when it appears within \\n for example." would also become more obvious. That is, it would become clear, that sed cannot just e.g. first look for any '\n' and replace them with <newline> and only then does the RE parsing on the result. => So perhaps, instead of "Both BREs and EREs shall also support the following additions:" something like: "In the context of sed, REs should be considered as if they had been defined with the following additions (with other rules from REs also applying to them, too):" ? c) I've written in several places, e.g. also in ticket #1551, that it's IMO not definitely clear how sed parses scripts (context addresses or s-command) that include RE: - it could e.g. first look for any sed specialities, like '\n' (for newline) or '\c' (for a escaped delimiter), process those and only then parse the RE (exactly according to the RE rules) - or it could do all in one pass from left to right, considering sed specialities ('\n' and '\c') as integral part of the RE I guess it's informally clear, that it's the latter. The proposed change in (b) would make this IMO already quite clear, at least if the part that describe these extensions all move back to "Regular Expressions in sed". I mean the parts which describe how the escape sequence '\c' (with c the delimiter) is processed, in the current proposal: - in "Addresses in sed" - and in the s-command => Perhaps one should additionally also add something like: "The delimiters, RE and any extensions to regular expressions from sed (like '\n' and \c) are processed in one pass." d) If: - the upcoming changes that Geoff mentioned in https://austingroupbugs.net/view.php?id=1546#c5768 - what I proposed above in (a), (b), (c) were adopted,... ... then I'd agree with what KRE said above in note #5623 and my objections from note #5624 would become moot, except for one point perhaps: In the current proposal, there is still: 'In a context address, any character other than <backslash> or <newline> can be specified for use as the delimiter by means of the construction "\cREc", where c is the chosen delimiter character.' For that very first \c (which isn't an escaped delimiter in the sense of "the RE should see it as literal") it's never mentioned that this is an escape sequence, but it's rather called "construction". Only, if one again thinks around some edges, one realises: - '\\cREc' wouldn't be something strange like, a literal \ followed by a delimiter c of a context address... it would be the try to make '\' the delimiter and 'c' would actually be part of the RE. - However, having '\' as delimiter is forbidden. Thus there's no strict need to mention that the '\c' must be an escape sequence respectively mustn't be preceded by an unescaped <backspace>. I still wonder whether we should hint that somehow, like a »('\c' is always an escape sequence as its escape character cannot be validly escaped)«. But I wouldn't insist on that (not that my opinion would matter much anyway ;-) ). e) shware_systems mentioned in note #5625 above, the page 3137, line 10622 which says: "Any <backslash> used to alter the default meaning of a subsequent character shall be discarded from the RE or the replacement before evaluating the RE or using the replacement." => I would start a new paragraph before that. It doesn't seem to be related to the other sentence ("A substitution shall be considered to have been performed") of that paragraph. Second, what does it mean? - Any <backslash> used to alter the default meaning of the following character... That would include: - '\n' - '\c' (with c being the delimiter) - with the changes proposed in https://austingroupbugs.net/view.php?id=1546, depending on the implementation '\+' - in principle even the special chars from BRE/ERE and those that become special when preceded by backslash - And it's discarded *before* the RE is evaluated, respectively the replacement is used. As shware_systems, says... that could e.g. be interpreted that [\n] would be a bracket expression with <newline>. => Doesn't that sentence need to be clarified? f) Page 167, line 5802 says: "When not inside a bracket expression, the interpretation of an ordinary character preceded by an unescaped <backslash> is undefined, except for..." and page 173, line 6044 says: "When not inside a bracket expression, the interpretation of an ordinary character preceded by an unescaped <backslash> is undefined, except for…" The additions by sed ('\c' and '\n') obviously touch this. Shall we do anything about that? Like e.g. adding a note to the RE chapter, that for REs in the context of sed, there are more definitions? I wouldn't strongly insist on that (not that my opinion would count much anyway ^^)... but since implementations do extend the RE language (like GNU's '\s', etc.) it might be reasonable to note, that in the context of sed, the escape secquence '\n' is already used and '\c' with c being the delimiter is also special. g) The last unclear point, with respect to this ticket, is IMO: What the following means: snAAA\nnXXXn Is '\n' a: - escaped delimiter and thus the command equivalent to: s/AAAn/XXX/ or a: - newline and thus the command equivalent to: s/AAA<newline>/XXX/ ?? The following is tricky: snAAA[\n]nXXXn It wwould IMO be already clear, if: - the upcoming changes that Geoff mentioned in https://austingroupbugs.net/view.php?id=1546#c5768 - what I proposed above in (a), (b), (c) were adopted. I.e. '\n' (for newline) and '\c' (which is here also '\n') would need to be considered part of the RE language (in the context of sed). Only then, it would be fully clear, that the s-command above equivalent to: 's/AAA[n/]/XXX/' which is the same as 's%AAA[n/]%XXX%' h) Unrelated to this ticket, but since it just fits in: What in the standard makes it clear that in 's/AAA[n/]/XXX/' the 2nd '/' (the one in the bracket expression) is not a delimiter? It's the non-escaped delimiter, which is (unlike \c) is not defined to be part of the RE language... so without what I propose in (c) above (which also includes non-escaped delimiters) one could still think of a hypothetical 2-pass parsing that gives: RE: AAA[n replacement: ] flags :XXX/ The RE rules on page 168, line 5840 merely say that '\' and the BRE respectivel ERE character loose their special meanings (in bracket expressions of BREs respectively EREs). '/' however isn't special to any of them... so one could argue, that it retains the special meaning (as delimiter). Issue History Date Modified Username Field Change ====================================================================== 2022-01-18 01:07 calestyo New Issue 2022-01-18 01:07 calestyo Name => Christoph Anton Mitterer 2022-01-18 01:07 calestyo Section => Utilities, sed / 9.3.5 RE Bracket Expression 2022-01-18 01:07 calestyo Page Number => - 2022-01-18 01:07 calestyo Line Number => - 2022-01-18 09:41 geoffclare Note Added: 0005621 2022-01-18 13:26 calestyo Note Added: 0005622 2022-01-18 16:30 kre Note Added: 0005623 2022-01-18 17:12 calestyo Note Added: 0005624 2022-01-18 18:41 shware_systems Note Added: 0005625 2022-01-18 21:17 calestyo Note Added: 0005626 2022-01-18 21:19 calestyo Note Edited: 0005626 2022-03-25 16:37 geoffclare Note Added: 0005762 2022-03-31 16:00 nick Relationship added related to 0001550 2022-04-02 02:38 calestyo Note Edited: 0005622 2022-04-02 22:00 calestyo Note Edited: 0005626 2022-04-03 01:59 calestyo Note Added: 0005778 ======================================================================
