On Thu, 2022-05-19 at 09:05 +0100, Harald van Dijk wrote: > > > > The above, AFAIU, mean that any shell/fnmatch matches a valid > > multibyte > > character... but also a byte that is not a character in the locale. > > Correct, though as I wrote later on, the way they go about it is > different.
And I think, for any real standardisation of this (which I'd still love to see) quite a few things would need to be reasonably defined, including but most likely not limited to: - Does * match bytes (by which I mean 1-n which don't form valid characters in the current locale). - The same for ? ... and if that matches bytes at all - only 1 or n? - In which "direction" is the matching done, which AFAIU would be important, e.g. \303\244\244 *if* '?' were to match also bytes, is '?\244' meant to be matching the character followed by the byte: (\303\244)\244 or could it non-match the byte followed by more bytes: (\303)\244\244 - And I guess these questions would also pop up for the ##, #, %% and % forms of parameter expansions, especially when one has a local like Big-5. In the sense of, can one strip of a character (or byte) that forms part of another character. If shells were required not to decompose such valid characters (that contain another valid character, when looked at it from right to left), then it would also need to be defined how the strings needed to be interpreted (most likely of course: as defined by the respective char encoding). So for all these cases it might additionally be required to check how the different shells behave when trying to ##, #, %%, % ... And AFAIU, some actually allow to "decompose" a character. And even if the standard were to say, that it must check whether the matched part is part of a bigger multibyte character (like in the BIG5 case) and then not allow to decompose that.... would it still be allowed to do so when the pattern contains bytes that are themselves not valid characters) - Are there any undesired side effects? Like bash, has the nocasematch shell option... which IIRC affects patterns... would we break anything in such fields? - I think it already is defined (more or less) which locale is actually used for the matching, i.e. the current one as set by LC_CTYPE and not e.g. the "lexical" locale defined on the start of the shell. > I tested this now. In that same list of shells, and in glibc > fnmatch(), > ? only matches a single invalid byte. Tested in an UTF-8 locale with > the > string \200\200 and the patterns ? and ??. With ?, they do not match. > With ??, they do. The next question that would come to my mind: Do these tests really give us a definite answer on the behaviour... or may some things be dependent on the specific locale? Maybe the above behaviour is *only* with UTF-8? Or can this be ruled out? > > > So unlike before, in the above bash/fnmatch do seem to let '?' > > match a > > single byte that is not a character... and the remaining ones have > > quite mixed feelings > Not quite: all of them always let ? match a single invalid byte, but > here we have a single byte that is invalid on its own, valid as part > of > a character, and appears in the string as part of that character. > When > processing \303\244, most shells don't process this as the single > byte > \303 followed by the single byte \244, they preprocess this so that > by > the time they actually check whether it matches, they just see the > character U+00C4, so that if a pattern looks for \303 on its own, it > will not be found. Hmm... seems a bit strange to me... I mean above you had: string pattern \303.\303\244 ?.? And e.g. bash didn't match.. my assumption was, because the first \303 is not a character. But later on you had: \303\244 \303* \303\244 \303? which bash *did* match. Sure, the 2 bytes together are already one character, but bash had to match the single \303 plus * or ? ... and if above ? didn't match the single invalid \303 it did match the single \244 here (which ain't a character either). No even if one says now it's the direction,... there was also: \303\244.\303 ?.? with no match in bash... the first ? should be okay, because it's a char,... the 2nd one would be the lone \303 byte. > > > Seem also a bit strange to me,... all shells match \243 against ? > > ... > > i.e. ? matches a single byte that is not a character... but later > > on it > > doesn't work again with \243] and ?] > > Remember that \243] is a single character β. \243] is not supposed to > match when given a pattern ?]. The pattern ?] means "any character, > followed by ]". "β" is a character not followed by ]. This is similar > to > how in UTF-8 environments, ä should not match against the pattern ?? > even though both of the bytes that make up ä individually do match > against the pattern ?. Okay but isn't that then the case where the matching direction actually matters? You had: \243 ? => all matched, with \243 alone being invalid \243] ?] => all didn't match, \243] being a char together, and ] alone, too So in the 2nd case, the ? would match the "whole" character, and there would be no (further) ] left to match. That's what I meant above,... a) it would need to be properly defined (and shared by all implementations) how this is matched,... i.e. first looking for characters, then for bytes? b) What is the expected behaviour when stripping of the ] ? Does it strip of the char ] and leave the byte \243 ... or does it not even match the char ] because that being considered part of the multibyte char. When ? (and analogously *) match single invalid bytes,... and presumably also strip of such... all would become quite difficult to handle (for users) when in some situations, it wouldn't - just depending on character encodings. > I think there is still value in standardising this, but there is a > bit > more variation than I expected. For the non-bosh implementations, I > can > think of a general idea of wording that would cover everything except > for the bits I wrote I felt should be regarded as implementation > bugs, > that shouldn't be too difficult. Would be interested to see that... But I guess this would also require any such implementations to "fix" what you consider bugs. Cheers, Chris.