Adding bug-...@gnu.org into this conversation. On Mon, Apr 25, 2022 at 02:50:22AM +0200, Christoph Anton Mitterer via austin-group-l at The Open Group wrote: > Hey. > > Geoff, I haven't had time yet to look at your updated proposal of > #1550, not sure whether I manage to do it this night or in the next > days. > But I'll definitely reply, so please be a bit more patient. :-) > > > However, on thing came to my minds again, which I think needs further > discussion... > > > > The current "solution" to a number of previous problems is: > > Inside a bracket expression there cannot be any escape sequences. > Therefore, there cannot be any \n (in the sense of <newline>) nor any > \c (in the sense of "un-delimitering" the delimiter character c). > > > While this is per se perfectly valid (and solves numerous issues), it > has one problem: > > (at least) GNU sed breaks it already! > > > > As you noted yourself in > https://www.austingroupbugs.net/view.php?id=1556#c5621 > > it requires POSIXLY_CORRECT=1 to work as it should. > > $ printf 'a\\b\n' | sed 's/a[\n]b/X/' > a\b > $ printf 'a\nb\n' | sed 's/a[\n]b/X/' > a > b > $ printf 'a\nb\n' | sed -z 's/a[\n]b/X/' > X > $ printf 'anb\n' | sed 's/a[\n]b/X/' > anb > $ export POSIXLY_CORRECT=1 > $ printf 'a\\b\n' | sed 's/a[\n]b/X/' > X > $ printf 'a\nb\n' | sed 's/a[\n]b/X/' > a > b > $ printf 'a\nb\n' | sed -z 's/a[\n]b/X/' > a > b > $ printf 'anb\n' | sed 's/a[\n]b/X/' > X > $ > > > NOT so for GNU's extension of '\s': > '\s' > Matches whitespace characters (spaces and tabs). Newlines > embedded in the pattern/hold spaces will also match... > (and I assume neither for any similar such extensions): > > $ printf 'asb\n' | sed 's/a[\s]b/X/' > X > $ printf 'a\\b\n' | sed 's/a[\s]b/X/' > X > $ printf 'a b\n' | sed 's/a[\s]b/X/' > a b > $ export POSIXLY_CORRECT=1 > $ printf 'asb\n' | sed 's/a[\s]b/X/' > X > calestyo@heisenberg:~$ printf 'a\\b\n' | sed 's/a[\s]b/X/' > X > calestyo@heisenberg:~$ printf 'a b\n' | sed 's/a[\s]b/X/' > a b > $ > > > It also works as expected for escaped delimiter characters: > $ printf 'aDb\n' | sed 'sDa[\D]bDXD' > X > $ printf 'a\\b\n' | sed 'sDa[\D]bDXD' > X > > even when the delimiter char has also special meaning when escaped (as > with '\s'): > $ printf 'asb\n' | sed 'ssa[\s]bsXs' > X > $ printf 'a\\b\n' | sed 'ssa[\s]bsXs' > X > $ printf 'a b\n' | sed 'ssa[\s]bsXs' > a b > > > (all the above with GNU sed 4.8). > > > So the only problematic case seems to be '\n'. > > > > I don't want to step on anyone's toes... but GNU sed is probably one of > the (if not the) major implementation of sed, isn't it? > > > And regardless of POSIXLY_CORRECT, the standard describes now a > behaviour (namely that the bracket expression [\n] is the literal > characters '\' or 'n' and *not* <newline>)... which is not shared by a > major implementation, at least not with its default settings. > > Anyone who reads the standard would assume that [\n] is not a > <newline>. > And of course we could just say "well your implementation is not > compliant" or "look at it's documentation, where it says about > POSIXLY_CORRECT" ... but that doesn't seem so good to me. > > Usually, implementations extend POSIX rather gracefully, but this is a > more serious deviation. > > > I mean should we just leave it at that? > > Or should we add some hint, e.g. indicating that portable applications > should not use '\n' but rather 'n\' ... or perhaps even generally place > '\' last in the bracket expression? > > > The best would of course be to get GNU change it's behaviour, though I > have no idea how likely that is ;-) > > I had tried to reach out to GNU and BusyBox sed maintainers before, and > while I got replies from BusyBox' I couldn't get in touch with GNU's. > > Is there anyone who's in contact with these people?
The GNU sed developers can be reached at bug-...@gnu.org (per the output of 'sed --help', and as done in this email). So if I'm restating your complaint correctly, you are worried that GNU sed's non-POSIX behavior (what you get by default when POSIXLY_CORRECT is not set) treats the four-byte sequence '[\n]' in an s-command regex as a bracket expression for the single character of a literal newline (that is, interpreting \n as an escape sequence even though it is inside a bracket expression), instead of as a bracket expression for either of a literal backslash or literal n; but concur that its behavior when being POSIX-compliant matches the POSIX rules. POSIX can't control what GNU sed does when in non-POSIX mode. But it can document a recommendation to spell the bracket expression intended to match either a backslash or an n in the order [n\] to avoid any potential confusion with [\n] being interpreted as an escape sequence. Or am I missing something else that you are proposing that either the Austin Group should do in its documentation efforts, and/or which GNU sed should do to comply with the recent Austin Group recommendations? -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org