Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

Christoph Anton Mitterer via austin-group-l at The Open Group Thu, 19 May 2022 17:12:37 -0700

On Thu, 2022-05-19 at 09:05 +0100, Harald van Dijk wrote:
> > 
> > The above, AFAIU, mean that any shell/fnmatch matches a valid
> > multibyte
> > character... but also a byte that is not a character in the locale.
> 
> Correct, though as I wrote later on, the way they go about it is
> different.


And I think, for any real standardisation of this (which I'd still love
to see) quite a few things would need to be reasonably defined,
including but most likely not limited to:
- Does * match bytes (by which I mean 1-n which don't form valid
  characters in the current locale).

- The same for ? ... and if that matches bytes at all - only 1 or n?

- In which "direction" is the matching done, which AFAIU would be
  important, e.g.
  \303\244\244
  *if* '?' were to match also bytes, is '?\244' meant to be matching
  the character followed by the byte:
    (\303\244)\244
  or could it non-match the byte followed by more bytes:
    (\303)\244\244

- And I guess these questions would also pop up for the ##, #, %% and %
  forms of parameter expansions, especially when one has a local like
  Big-5.
  In the sense of, can one strip of a character (or byte) that forms
  part of another character.
  If shells were required not to decompose such valid characters (that
  contain another valid character, when looked at it from right to
  left), then it would also need to be defined how the strings needed
  to be interpreted (most likely of course: as defined by the
  respective char encoding).

  So for all these cases it might additionally be required to check how
  the different shells behave when trying to ##, #, %%, % ...
  And AFAIU, some actually allow to "decompose" a character.

  And even if the standard were to say, that it must check whether the
  matched part is part of a bigger multibyte character (like in the
  BIG5 case) and then not allow to decompose that.... would it still
  be allowed to do so when the pattern contains bytes that are
  themselves not valid characters)

- Are there any undesired side effects? Like bash, has the nocasematch
  shell option... which IIRC affects patterns... would we break
  anything in such fields?

- I think it already is defined (more or less) which locale is actually
  used for the matching, i.e. the current one as set by LC_CTYPE and
  not e.g. the "lexical" locale defined on the start of the shell. 



> I tested this now. In that same list of shells, and in glibc
> fnmatch(), 
> ? only matches a single invalid byte. Tested in an UTF-8 locale with
> the 
> string \200\200 and the patterns ? and ??. With ?, they do not match.
> With ??, they do.

The next question that would come to my mind:
Do these tests really give us a definite answer on the behaviour... or
may some things be dependent on the specific locale? Maybe the above
behaviour is *only* with UTF-8?
Or can this be ruled out?

> 


> > So unlike before, in the above bash/fnmatch do seem to let '?'
> > match a
> > single byte that is not a character... and the remaining ones have
> > quite mixed feelings
> Not quite: all of them always let ? match a single invalid byte, but 
> here we have a single byte that is invalid on its own, valid as part
> of 
> a character, and appears in the string as part of that character.
> When 
> processing \303\244, most shells don't process this as the single
> byte 
> \303 followed by the single byte \244, they preprocess this so that
> by 
> the time they actually check whether it matches, they just see the 
> character U+00C4, so that if a pattern looks for \303 on its own, it 
> will not be found.

Hmm... seems a bit strange to me... I mean above you had:

string                  pattern
\303.\303\244           ?.?

And e.g. bash didn't match.. my assumption was, because the first \303
is not a character.

But later on you had:
\303\244                \303*
\303\244                \303?
which bash *did* match.

Sure, the 2 bytes together are already one character, but bash had to
match the single \303 plus * or ? ... and if above ? didn't match the
single invalid \303 it did match the single \244 here (which ain't a
character either).

No even if one says now it's the direction,... there was also:
\303\244.\303           ?.?
with no match in bash... the first ? should be okay, because it's a
char,... the 2nd one would be the lone \303 byte.

> 

> > Seem also a bit strange to me,... all shells match \243 against ?
> > ...
> > i.e. ? matches a single byte that is not a character... but later
> > on it
> > doesn't work again with \243] and ?]
> 
> Remember that \243] is a single character β. \243] is not supposed to
> match when given a pattern ?]. The pattern ?] means "any character, 
> followed by ]". "β" is a character not followed by ]. This is similar
> to 
> how in UTF-8 environments, ä should not match against the pattern ?? 
> even though both of the bytes that make up ä individually do match 
> against the pattern ?.

Okay but isn't that then the case where the matching direction actually
matters?

You had:
\243    ?       => all matched, with \243 alone being invalid
\243]   ?]      => all didn't match, \243] being a char together,
                                     and ] alone, too
So in the 2nd case, the ? would match the "whole" character, and there
would be no (further) ] left to match.

That's what I meant above,...
a) it would need to be properly defined (and shared by all
   implementations) how this is matched,... i.e. first looking for
   characters, then for bytes?

b) What is the expected behaviour when stripping of the ] ?
   Does it strip of the char ] and leave the byte \243 ... or does it
   not even match the char ] because that being considered part of the
   multibyte char.

   When ? (and analogously *) match single invalid bytes,... and
   presumably also strip of such... all would become quite difficult to
   handle (for users) when in some situations, it wouldn't - just
   depending on character encodings.
  

> I think there is still value in standardising this, but there is a
> bit 
> more variation than I expected. For the non-bosh implementations, I
> can 
> think of a general idea of wording that would cover everything except
> for the bits I wrote I felt should be regarded as implementation
> bugs, 
> that shouldn't be too difficult.

Would be interested to see that...

But I guess this would also require any such implementations to "fix"
what you consider bugs.


Cheers,
Chris.

Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

Reply via email to