On 27/06/2019 07:15, Stephane Chazelas wrote:
2019-06-26 23:56:06 +0100, Harald van Dijk:
[...]
You are proposing a fundamental change to the design of pattern matching,
not a clarification as you previously called it, and you are now discussing
how to allow the behaviour of one specific shell that does not behave the
way you like, but not the other shells that also do not behave the way you
like, when those other shells were not only changed intentionally to get
more consistent behaviour, at least in my case as the result of a user
request, but also because that more consistent behaviour is required by the
current version of POSIX, solely because of theoretical problems with file
names specifically crafted to break scripts, file names that are not
actually used in the wild.
[...]

I'm not a shell implementer. I'm on the side of the application
writer, I want to be able to write portable shell scripts, and
POSIX (*Portable* Operating System *Interface*) is meant to work
for me. It's meant to tell me what I can and cannot write in my
script and the behaviour to expect. It's meant to help you the
implementer write your shell so that it can interpret my
portable script the way it's meant to.

Oh, I agree that there is a bug. Given that most shells do not behave the way POSIX specifies, POSIX should not be requiring that behaviour. However, if you wait until after some shells have already implemented what is specified, it's too late then to just change the rules to forbid it. Your logic works both ways: now those shells have to be taken into account. It is not reasonable for POSIX to say that uses are portable that in fact are not, or no longer are.

But in fact although the wording you talked about so far did not include it, you did raise that point already in your 26/06/2019 14:39 +01:00 message:

So the only characters that need quoted (or put inside [...]
when the pattern is in the result of some word expansion --
remember that you need to move tha backslash processing out of
the shell pattern matching as its a fnmatch()/glob() thing only)
are ?, [, * and also \ to accomodate shells that have implemented
some form or another of special processing of \ independently of
quoting and (, and ) to accomodate ksh93 (in pattern matching
only, those are not a problem in pathname expansion).

Sorry for missing it the first time.

Today, by your reading of the spec and I agree it can be seen as
a valid reading, the spec is telling me that:

1.

a='\.'
printf '%s\n' $a

is a portable script that is meant to output "."

2.

a='\**'
printf '%s\n' $a

is a portable script that is meant to list the filenames that
start with "*" in the current directory

3.

pattern='*;*'
case $var in ($pattern) echo yes; esac

is a non-standard, non-portable script with unspecified
behaviour because shell implementations are free to use that ";"
as an extended glob operator.

4.

string='@(foo)'
echo $string

is a non-standard, non-portable script which is not guaranteed
to output @(foo).

5.

string='@(foo)'
case $string in $string) echo yes; esac

is a non-standard, non-portable script which is not guaranteed
to output "yes".

6.

pattern='@(*)'
case "@(foo)" in $pattern) echo yes; esac

is a non-standard, non-portable script which is not guaranteed
to output "yes".

Agreed with all of these that that is what I believe POSIX currently specifies.

1 and 2 is the reason I raised bug 1234. 1 couldn't be furthest
away from the truth. Only bash5 exhibits that behaviour and it's
evident it's a bad idea.

If you accept that unquoted backslash behaves that way in some shells, even before bash 5, then changing the shell to always treat unquoted backslash the same way makes the shell behaviour easier to understand. I consider it an improvement over backslash's meaning changing in ways that were hard to predict.

                         It's evident that it was not the
intention of the spec as no shell at the time it was written did
it. Even if POSIX made it very explicit that 1 is required to
behave as described above, I could probably not call it a
portable script in a million year, as I'd expect shell
implementations would rather keep their backward
compatibility than implement that unreasonable requirement
(which IMO doesn't help at all with consistency). So the spec is
wrong and needs to be fixed.

Yes, to document current practice, the spec should effectively say in some way that whether and if so to what extent backslash can act as an escape character (in addition to a quote character) in shells is unspecified.

2 is slightly more portable, but even in those shells where it
does that, that's not because they implement \ processing the
way POSIX seems to specify it, and all do it a different way.
I'm not opposing POSIX *allows* a \ in an unquoted word
expansion to have a special meaning when it's preceding *, ? and
[ as that's what several implementations do and it's not
breaking that many common shell usages.

It should not be limited to when it's preceding any specific character, though. That is something no shell has done. Shells currently vary in whether backslash can function as an escape character during pattern matching, but when it can, it does not depend on which character follows it.

3 is portable in practice. And I should be able to rely on it.
I'd rather POSIX doesn't open the door for a shell (or
fnmatch()...) to choose ; to be a new glob operator, I would
rather the sh glob operators stay ?, [] and * (and \ now added
because of those shells that treat it specially), so I know
which to escape (with quoting (or \ in fnmatch()) or [...] when
in word expansions) or to look out for. Several shells have some
of those operators but they are not enabled in posix/sh mode so
they interpret sh scripts like sh is meant to.

Agreed that it makes sense to disallow most characters from being treated specially during pattern matching, when most characters are already not treated specially in existing shells.

4 is portable in practice. 5 as well but only because of the
buggy fallback string comparison in ksh93.

Sure, but worth adding is that ksh's behaviour is different from its implementation. According to its documentation, 4 isn't supposed to work the way it does in other shells.

6 is the only one that is true. Yes, there is *one* shell (a
shell generally considered "experimental" and not in wide use)
where that won't work as expected (won't output yes) as that's
one case where ksh93's extended glob operator is conflicting
with sh compatibility. It's not consistent with 4 there. Geoff's
proposing to fix that inconsistency to allow that operator to be
used for pathname expansion, but I believe it would be more
reasonable to fix it by not allowing it for "case" (make 6 a
portable script again) to make the standard consistent and
clear. Then ksh93 could enable those extended operators wherever
it likes when called as ksh, but not when called as sh (at least
not in the result of word expansions; basically reverting to
ksh88 behaviour).

If the ksh maintainers are willing to implement that and document exactly when these extensions do and don't work, that makes sense to me.

I could be convinced that it makes sense for the ksh93 X(...)
operators to be allowed if there was one non-anecdotal
implementation of fnmatch() that implemented it, but I don't
think there it. find implementations usually have a -regex
predicate to do things that basic globs can't do instead.

I also like the idea of opening up a way for shell wildcards to
be extended in the future, but it's a dangerous business. Today
in practice, scripts doing things like "find . -name
'*([0-9]).mp3'" to match on "foo(2).mp3" are likely to exist and
would be broken if "find" (fnmatch()) started to implement that
*(...) ksh operator (worse for the # or (..|..) or ~ or ^
extended zsh operators (which are not enabled in sh mode)).

Despite that pax example in the POSIX rationale that you quoted
earlier, I don't expect many people are aware that POSIX
currently leaves the behaviour unspecified and requires them to
write

find . -name '*\([0-9]\).mp3' or
find . -name '*\ *.mp3' and even
find . -name '*[\ ]*.mp3'

In practice, they don't need to escape those " ", "(" even
though they would need to quote them when used in a shell glob.

If there is no fnmatch() implementation that behaves that way, then agreed that it makes sense to just specify that. That pax example in the rationale should then also be changed to not escape any parenthesis.

What did this pax example come from, though? Was that based on a real pax implementation that did have special treatment of parentheses, not just an invention?

dash has a build-time option to use fnmatch() internally to do its pattern matching. If there are fnmatch() implementations out there that treat characters as special, this will be visible in dash when it's built in that mode too.

Cheers,
Harald van Dijk

Reply via email to