Re: [Issue 8 drafts 0001556]: clarify meaning of \n used in a bracket expression in a sed context address or s-command

Eric Blake via austin-group-l at The Open Group Mon, 25 Apr 2022 09:07:07 -0700

Adding bug-...@gnu.org into this conversation.

On Mon, Apr 25, 2022 at 02:50:22AM +0200, Christoph Anton Mitterer via 
austin-group-l at The Open Group wrote:
> Hey.
> 
> Geoff, I haven't had time yet to look at your updated proposal of
> #1550, not sure whether I manage to do it this night or in the next
> days.
> But I'll definitely reply, so please be a bit more patient. :-)
> 
> 
> However, on thing came to my minds again, which I think needs further
> discussion...
> 
> 
> 
> The current "solution" to a number of previous problems is:
> 
> Inside a bracket expression there cannot be any escape sequences.
> Therefore, there cannot be any \n (in the sense of <newline>) nor any
> \c (in the sense of "un-delimitering" the delimiter character c).
> 
> 
> While this is per se perfectly valid (and solves numerous issues), it
> has one problem:
> 
> (at least) GNU sed breaks it already!
> 
> 
> 
> As you noted yourself in
> https://www.austingroupbugs.net/view.php?id=1556#c5621
> 
> it requires POSIXLY_CORRECT=1 to work as it should.
> 
> $ printf 'a\\b\n' | sed 's/a[\n]b/X/'
> a\b
> $ printf 'a\nb\n' | sed 's/a[\n]b/X/'
> a
> b
> $ printf 'a\nb\n' | sed -z 's/a[\n]b/X/'
> X
> $ printf 'anb\n' | sed 's/a[\n]b/X/'
> anb
> $ export POSIXLY_CORRECT=1
> $ printf 'a\\b\n' | sed 's/a[\n]b/X/'
> X
> $ printf 'a\nb\n' | sed 's/a[\n]b/X/'
> a
> b
> $ printf 'a\nb\n' | sed -z 's/a[\n]b/X/'
> a
> b
> $ printf 'anb\n' | sed 's/a[\n]b/X/'
> X
> $ 
> 
> 
> NOT so for GNU's extension of '\s':
> '\s'
>      Matches whitespace characters (spaces and tabs).  Newlines
>      embedded in the pattern/hold spaces will also match...
> (and I assume neither for any similar such extensions):
> 
> $ printf 'asb\n' | sed 's/a[\s]b/X/'
> X
> $ printf 'a\\b\n' | sed 's/a[\s]b/X/'
> X
> $ printf 'a b\n' | sed 's/a[\s]b/X/'
> a b
> $ export POSIXLY_CORRECT=1
> $ printf 'asb\n' | sed 's/a[\s]b/X/'
> X
> calestyo@heisenberg:~$ printf 'a\\b\n' | sed 's/a[\s]b/X/'
> X
> calestyo@heisenberg:~$ printf 'a b\n' | sed 's/a[\s]b/X/'
> a b
> $
> 
> 
> It also works as expected for escaped delimiter characters:
> $ printf 'aDb\n' | sed 'sDa[\D]bDXD'
> X
> $ printf 'a\\b\n' | sed 'sDa[\D]bDXD'
> X
> 
> even when the delimiter char has also special meaning when escaped (as
> with '\s'):
> $ printf 'asb\n' | sed 'ssa[\s]bsXs'
> X
> $ printf 'a\\b\n' | sed 'ssa[\s]bsXs'
> X
> $ printf 'a b\n' | sed 'ssa[\s]bsXs'
> a b
> 
> 
> (all the above with GNU sed 4.8).
> 
> 
> So the only problematic case seems to be '\n'.
> 
> 
> 
> I don't want to step on anyone's toes... but GNU sed is probably one of
> the (if not the) major implementation of sed, isn't it?
> 
> 
> And regardless of POSIXLY_CORRECT, the standard describes now a
> behaviour (namely that the bracket expression [\n] is the literal
> characters '\' or 'n' and *not* <newline>)... which is not shared by a
> major implementation, at least not with its default settings.
> 
> Anyone who reads the standard would assume that [\n] is not a
> <newline>. 
> And of course we could just say "well your implementation is not
> compliant" or "look at it's documentation, where it says about
> POSIXLY_CORRECT" ... but that doesn't seem so good to me.
> 
> Usually, implementations extend POSIX rather gracefully, but this is a
> more serious deviation.
> 
> 
> I mean should we just leave it at that?
> 
> Or should we add some hint, e.g. indicating that portable applications
> should not use '\n' but rather 'n\' ... or perhaps even generally place
> '\' last in the bracket expression?
> 
> 
> The best would of course be to get GNU change it's behaviour, though I
> have no idea how likely that is ;-)
> 
> I had tried to reach out to GNU and BusyBox sed maintainers before, and
> while I got replies from BusyBox' I couldn't get in touch with GNU's.
> 
> Is there anyone who's in contact with these people?


The GNU sed developers can be reached at bug-...@gnu.org (per the
output of 'sed --help', and as done in this email).

So if I'm restating your complaint correctly, you are worried that GNU
sed's non-POSIX behavior (what you get by default when POSIXLY_CORRECT
is not set) treats the four-byte sequence '[\n]' in an s-command regex
as a bracket expression for the single character of a literal newline
(that is, interpreting \n as an escape sequence even though it is
inside a bracket expression), instead of as a bracket expression for
either of a literal backslash or literal n; but concur that its
behavior when being POSIX-compliant matches the POSIX rules.

POSIX can't control what GNU sed does when in non-POSIX mode.  But it
can document a recommendation to spell the bracket expression intended
to match either a backslash or an n in the order [n\] to avoid any
potential confusion with [\n] being interpreted as an escape sequence.

Or am I missing something else that you are proposing that either the
Austin Group should do in its documentation efforts, and/or which GNU
sed should do to comply with the recent Austin Group recommendations?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Issue 8 drafts 0001556]: clarify meaning of \n used in a bracket expression in a sed context address or s-command

Reply via email to