Re: More issues with pattern matching
On 05/08/2020 15:54, Geoff Clare via austin-group-l at The Open Group wrote: Harald van Dijk wrote, on 31 Jul 2020: Take the previous example glibc's cy_GB.UTF-8 locale, but with a different collating element: in this locale, "dd" is a single collating element too. Therefore, this must be matchable by bracket expressions. Incorrect. I think you overlooked these statements in XBD 9.3.5 items 2 and 3: It is unspecified whether a matching list expression matches a multi-character collating element that is matched by one of the expressions. It is unspecified whether a non-matching list expression matches a multi-character collating element that is not matched by any of the expressions. My message was indirectly in reply to your message where you claimed that shells were required to support this. I'm very happy to see that this is actually not true, thanks for that. Cheers, Harald van Dijk
Re: More issues with pattern matching
Harald van Dijk wrote, on 31 Jul 2020: > > Take the previous example glibc's cy_GB.UTF-8 locale, but with a different > collating element: in this locale, "dd" is a single collating element too. > Therefore, this must be matchable by bracket expressions. Incorrect. I think you overlooked these statements in XBD 9.3.5 items 2 and 3: It is unspecified whether a matching list expression matches a multi-character collating element that is matched by one of the expressions. It is unspecified whether a non-matching list expression matches a multi-character collating element that is not matched by any of the expressions. -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: More issues with pattern matching
On 31/07/2020 00:10, Harald van Dijk wrote: Take the previous example glibc's cy_GB.UTF-8 locale, but with a different collating element: in this locale, "dd" is a single collating element too. Therefore, this must be matchable by bracket expressions. However, "d" individually must *also* be matched by pattern expressions. "dd" can be matched by both [!x] and [!x][!x]. A shell cannot use regcomp()+regexec() to find the longest match for [!x] and assume that that is matched: a shell where case dd in [!x]d) echo match ;; esac does not print "match" does not implement what POSIX requires. A shell where case dd in [!x]) echo match ;; esac does not print "match" does not implement what POSIX requires either. Using regcomp()+regexec() to bind [!x] to either "d" or "dd" without taking the rest of the pattern into account will fail to match in one of these cases. And it needn't be the same way for all bracket expressions in a single pattern: case ddd in [!x][!x]) echo match ;; esac Shells are required by POSIX to consider both the possibility that [!x] picks up "d" and that it picks up "dd" for each bracket expression individually. A followup example: it seems downright crazy that POSIX would require that case ddd in *[!d]*) echo match ;; esac prints "match", yet that appears to be exactly what it does require, and exactly what yash implements: "dd" is a collating element which is not "d", and therefore must be matched by [!d]. And this is something where GNU fail to implement the POSIX-specified behaviour even in regular expressions. If the regular expression support does not work as specified, shells cannot implement pattern matching on top of regular expressions and expect correct results. $ echo ddd | LC_ALL=cy_GB.UTF-8 grep '[^d]' $ echo ddd | LC_ALL=cy_GB.UTF-8 grep '.[^d]' ddd It's clear that if the second prints 'ddd', so should the first, so it's clear that this result indicates a bug. What's not clear to me is whether the second should print 'ddd'. When the string 'ddd' is part of a set of strings to be sorted, the first collating element is 'dd' and the second is 'd'. The second and third character do not together form a collating element. Is it correct that grep nevertheless uses that second and third 'd' to match '[^d]'? If that is not correct, then shells cannot use regexec() at a given starting position: that given starting position may yield different collating elements compared to when the string is searched from the beginning. Cheers, Harald van Dijk
Re: More issues with pattern matching
On 26/09/2019 10:20, Geoff Clare wrote: Geoff Clare wrote, on 26 Sep 2019: Are shells required to support this, and are shells therefore implicitly required to translate patterns to regular expressions, or should it be okay to implement this with single character support only? Shells are required to support it. They don't need to translate entire patterns to regular expressions - they can use either regcomp()+regexec() or fnmatch() to see if the bracket expression matches the next character. Sorry, I should have written "matches *at* the next character" here; I didn't mean to imply checking against a single character. For example, if using regcomp()+regexec() the shell could try to match the bracket expression against the remainder of the string and see how much of it regexec() reported as matching. To use fnmatch() I suppose you would have to use it in a loop, passing it first one character, then two, etc. (stopping at the number of characters between the '.'s). As I had replied at the time, it is fundamentally impossible in the general case as POSIX does not provide any mechanism to escape characters and there is nothing in POSIX that rules out the possibility of a collating element containing "=]" or ".]". However, ignoring that aspect of it, looking at implementing this once again, implementing it the way you specified is incorrect, fixing it to make it correct cannot possibly be done efficiently with standard library support, and shells in general don't bother to implement what POSIX specifies here. Take the previous example glibc's cy_GB.UTF-8 locale, but with a different collating element: in this locale, "dd" is a single collating element too. Therefore, this must be matchable by bracket expressions. However, "d" individually must *also* be matched by pattern expressions. "dd" can be matched by both [!x] and [!x][!x]. A shell cannot use regcomp()+regexec() to find the longest match for [!x] and assume that that is matched: a shell where case dd in [!x]d) echo match ;; esac does not print "match" does not implement what POSIX requires. A shell where case dd in [!x]) echo match ;; esac does not print "match" does not implement what POSIX requires either. Using regcomp()+regexec() to bind [!x] to either "d" or "dd" without taking the rest of the pattern into account will fail to match in one of these cases. And it needn't be the same way for all bracket expressions in a single pattern: case ddd in [!x][!x]) echo match ;; esac Shells are required by POSIX to consider both the possibility that [!x] picks up "d" and that it picks up "dd" for each bracket expression individually. This means that in the worst case, if every bracket expression in a pattern has X ways to match, and a pattern has Y bracket expressions, the shell is required to consider X^Y possibilities. This is completely unreasonable and it's obvious why no shell actually does this. The complexity can be reduced in theory, but POSIX does not expose enough information to allow that to be implemented in a shell. The only way around this mess is by translating the whole pattern to a regular expression, as only the C library has enough detailed knowledge about the locale that it can implement it efficiently.[*] Doing that has its own new set of problems though: translating the whole pattern to a regular expression means the shell no longer has the option to decide how to handle invalid byte sequences (byte sequences that lead to EILSEQ) that shells in general try to tolerate, and the shell no longer has the option to decide how to handle invalid patterns (patterns containing non-existent character classes or collating elements) which shells in general also aim to tolerate. Cheers, Harald van Dijk [*] I have not investigated whether implementations actually do do this efficiently.
Re: More issues with pattern matching
"Schwarz, Konrad" wrote: > > -Original Message- > > From: Robert Elz > > > So, is [[:"alpha":]] required to be treated the same as [[:alpha:]] , not > > allowed to be treated the same, > > explicitly unspecified, or simply never considered (previously) ? > > An argument for requiring [[:"alpha":]] to be the same as [[:alpha:]] is that > it would allow character-class names > with white space, e.g., "title case". There is no need to do this since my implementations for [[:alpha:]] first check for the resence of ":]" and then use the text bewteen [[: and :]] as character class name. I expect other implementations to do the same. Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'
Re: More issues with pattern matching
Robert Elz wrote, on 27 Sep 2019: > > | In the case of [x[:bogus:]], the use of both colons clearly indicates > | the intention to use the new character-class feature. If the name > | between the colons is not a valid class name, that is likely due to > | an error on the user or application writer's part when typing the name. > > I had been waiting for that argument, it is the only one that is > half way rational, and supports that position. But half way is as > far as it gets. > > POSIX allows locales to define new char classes, it says so, XBD 7.3, > page 141, lines 4218-4226. > > Since a locale is allowed to define a new char class name, the shell > (or regcomp() for the RE case) cannot know whether the user here: > > | For example, if a user types: > | > | grep '[[:alhpa:]]' file > > made a typo for alpha (the standard posix defined char class), or really > intended alhpa a locale specific char class in some locale which is not > the current one. > > Making this some kind of error, in either REs, or shell patterns (whatever > the effect of that is) makes it impossible for users to ever safely, and > simply, use the locale specific locale name. > > They cannot even test which locale is in use as (aside from it being > impossible > to be sure which locales have added this new char class to their definitions) > there's no guarantee that even if we know that LC_CTYPE=EN_dislexic > contains the alhpa character class, in some implementations, there is no > sane way to know whether the current impoementation does. > > That is, unless you're requiring that before a locale specific char class > can be used, the user (on the command line) or script, is required to > query the locale and test whether the char class is defined there or not. > > Requiring that would be absurd. You might consider it absurd, but it is what the standard requires applications and users to do in order to avoid "undefined results" (as per XBD 9.1 under "invalid"). The standard even acknowledges that applications need to be able to do that, in the APPLICATION USAGE for the locale utility: Implementations are not required to write out the actual values for keywords in the categories LC_CTYPE and LC_COLLATE ; however, they must write out the categories (allowing an application to determine, for example, which character classes are available). In a C program, finding out if a character class name is valid for the current locale is simply a matter of calling wctype(name) and checking whether it returns (wctype_t)0. So applications using fnmatch(), glob() or regcomp() can do that before using a name that isn't one of the mandated ones. > | > So, is [[:"alpha":]] required to be treated the same as [[:alpha:]] , > | > not allowed to be treated the same, explicitly unspecified, or simply > | > never considered (previously) ? > | > | I believe the intention is that it be treated the same as [[:alpha:]]. > > Good, that is what I would have hoped. Now maybe we should add something > to make that explicit. Yes, I think an addition is warranted. Maybe we should add a new paragraph to 2.13 (before the 2.13.1 heading) along the lines of: In the shell, any quoting characters (see [xref to 2.2]) that are present in a word to be used as a pattern, and are treated as special, shall participate in pattern matching only through their effects on other characters; they shall not themselves be treated as pattern characters. For example: ls -ld \\* lists files with names that begin with a single , ls -ld "?"* lists files with names that begin with a , ls -ld [[:'alpha':]]* lists files with names that begin with an alphabetic character in the current locale, and ls -ld [[':alpha:']]* lists files with names that begin with a character from the set { '[', ':', 'a', 'l', 'p', 'h' } followed by a ']'. > > | The word "may" has a strict usage. See XBD 1.5 - it "Describes a > | feature or behavior that is optional for an implementation that > | conforms to POSIX.1-2017." > | > | However, there have been cases in the past where incorrect uses "may" > | have been found and changed to "can". > | > | In any case, the "shall" in XCU 2.13.1 overrides it. > > Only for shell patterns, we still need to decide whether it was the > defined "may" or an erroneous use which should be replaced by "can" > for regular expressions. Given the shell imperative, and the desire > to make bracket expressions in sh patterns and REs as equivalent as > possible, I suspect the latter. The rationale in XRAT says the opposite (A.9.1): The ISO POSIX-2:1993 standard required bracket expressions like "[^[:lower:]]" to match multi-character collating elements such as "ij". However, this requirement led to behavior that many users did not expect and that could not feasibly be mimicked in user code, and it was rarely if ever
Re: More issues with pattern matching
On 27/09/2019 02:26, Robert Elz wrote: Date:Thu, 26 Sep 2019 22:58:10 +0100 From:Harald van Dijk Message-ID: | 9.3.5 rule 1: | "shall be followed by" is a requirement on applications, is it not? It is. | When that requirement is violated, the regular expression or shell | pattern is undefined, From where do you draw that conclusion, I see nothing to that effect. Rather, when that requirement is violated, what exists is not a character class (or one of the others). That is all one can conclude from that text. Combine it with what is said about violations, which had been referenced in this thread already: When invalid is not used, violations of the specified syntax or semantics for REs produce undefined results: this may entail an error, enabling an extended syntax for that RE, or using the construct in error as literal characters to be matched. Cheers, Harald van Dijk
Re: More issues with pattern matching
Date:Thu, 26 Sep 2019 22:58:10 +0100 From:Harald van Dijk Message-ID: | 9.3.5 rule 1: | "shall be followed by" is a requirement on applications, is it not? It is. | When that requirement is violated, the regular expression or shell | pattern is undefined, >From where do you draw that conclusion, I see nothing to that effect. Rather, when that requirement is violated, what exists is not a character class (or one of the others). That is all one can conclude from that text. kre
Re: More issues with pattern matching
On 26/09/2019 22:12, Robert Elz wrote: a...@gigawatt.nl said: | If this is the whole pattern, then agreed, but if this is only part of the | pattern, I am not sure. [[:alpha]:]] is interpreted by many shells (bash, | bosh, mksh, zsh) as a character class containing an invalid character class | name "alpha]". The part about the invalid class name is certainly correct, but the interpretation cannot be, XBD 9.3.5 page 185, lines 6136-6138: A character class expression is expressed as a character class name enclosed within bracket- ("[:" and ":]") delimiters. Since "alpha]" is not (cannot be) a character class name, we do not have a character class expression at all, as a character class name is required to exist between the delimiters for a character class expression to exist. 9.3.5 rule 1: The character sequences "[.", "[=", and "[:" [...]. These symbols shall be followed by a valid expression and the matching terminating sequence ".]", "=]", or ":]", as described in the following items. "shall be followed by" is a requirement on applications, is it not? When that requirement is violated, the regular expression or shell pattern is undefined, and interpreting alpha] as an invalid character class name is a reasonable result. Cheers, Harald van Dijk
Re: More issues with pattern matching
Date:Thu, 26 Sep 2019 17:54:21 +0100 From:Geoff Clare Message-ID: <20190926165421.GA32280@lt2.masqnet> | In the case of [x[:bogus:]], the use of both colons clearly indicates | the intention to use the new character-class feature. If the name | between the colons is not a valid class name, that is likely due to | an error on the user or application writer's part when typing the name. I had been waiting for that argument, it is the only one that is half way rational, and supports that position. But half way is as far as it gets. POSIX allows locales to define new char classes, it says so, XBD 7.3, page 141, lines 4218-4226. Since a locale is allowed to define a new char class name, the shell (or regcomp() for the RE case) cannot know whether the user here: | For example, if a user types: | | grep '[[:alhpa:]]' file made a typo for alpha (the standard posix defined char class), or really intended alhpa a locale specific char class in some locale which is not the current one. Making this some kind of error, in either REs, or shell patterns (whatever the effect of that is) makes it impossible for users to ever safely, and simply, use the locale specific locale name. They cannot even test which locale is in use as (aside from it being impossible to be sure which locales have added this new char class to their definitions) there's no guarantee that even if we know that LC_CTYPE=EN_dislexic contains the alhpa character class, in some implementations, there is no sane way to know whether the current impoementation does. That is, unless you're requiring that before a locale specific char class can be used, the user (on the command line) or script, is required to query the locale and test whether the char class is defined there or not. Requiring that would be absurd. Disallowing users from using locale specific char classes even though the locale is free to provide them would be absurd. A non-absurd outcome is achieved only when unknown char class names are treated as missing empty class definitions in the current locale. That works, is easy, and clean. And yes, it means that what are really user errors cannot be trivially diagnosed, but simply produce unexpected results. But this is far from the only case where that happens - sh is a very forgiving language, vast numbers of obvious user errors are allowed to pass undiagnosed, because the shell cannot know that what the user entered is not actualy what they intended to enter, and preventing genuine work in order to improve error diagnosis is not the direction that the shell has ever taken. Regular expressions are more strictly interpreted, and do have detectable error cases, so in theory those could give errors for the case you describe (using the char class in the pattern arg to grep, for example) but the same arguments as above apply here as well, so that is not a desirable outcome. Further, even if it were, it would be difficult to achieve, given that POSIX has merged the definitions of bracket expressions in shell patterns and regular expressions into one definition, and we really want unknown clar class names to be usable (if not matching anything) in shell patterns (as they don't generate error messages if invalid, they simply match different things - or sometimes nothing - which is even harder to diagnose than noticing a mistyped class name. So to: | It is not absurd, it makes perfect sense. I could not disagree more. | This is the same reason we added item 8, I have no problems with item 8, and I understand it (even though I don't implement it) it simply is not relevant to anything we're currently talking about. However: | but POSIX was preventing them from behaving in a | way that is more useful to the user. was a very good argument. So let's adopt the same one for locale defined class names, and make sure they work well, and are useful to the user. | > So, is [[:"alpha":]] required to be treated the same as [[:alpha:]] , | > not allowed to be treated the same, explicitly unspecified, or simply | > never considered (previously) ? | | I believe the intention is that it be treated the same as [[:alpha:]]. Good, that is what I would have hoped. Now maybe we should add something to make that explicit. | This is the only reasonable conclusion if you consider the similarity to: | ls *"a"* That is a good analogy. | Clearly the intention here is that the quotes are not treated as part of | the pattern, even though pathname expansion is done before quote removal. Yes, agreed. That is what I do, and I wish everyone would do the same. An interlude before we continue: konrad.schw...@siemens.com said: | An argument for requiring [[:"alpha":]] to be the same as [[:alpha:]] | is that it would allow character-class names with white space, e.g., | "title case". That would be a nice argument, except that it isn't possible, there actually is a syntax for
Re: More issues with pattern matching
Robert Elz wrote, on 26 Sep 2019: > > | Good point. I think that this, and the behaviour I described, are > | both allowed by the standard. > > If they are, they shouldn't be. > > Before char classes, equiv classes, and collating elements were > invented, bracket expressions could contain anything (so could > patterns in general). That makes it hard to add anything new > without potentially invalidating previously valid code. > > The solution to that relies upon backet expressions being sets, > where while legal, putting an element in the set more than once > is a waste of time, and accomplishes nothing. > > That's why these new forms are defined only inside bracket expressions, > and all have the property of a duplicated character in their syntax, that > is, isn't just that [: :] looks pretty, whereas [: ] doesn't, it is the > only way to more or less safely add this new form to patterns. > > So, if we have > > [[:alpha] > > there is absolutely no question but that this is a bracket expr > that matches one of the 7 chars > [ : a l p h a > and is in no way any kind of character class reference, whatever it > looks like its author may have intended, and regardless of what comes > after it. > > If the standard says any different, or implies different, or even allows > different, it is simply wrong. > Now if this kind of "invalid char class" (invalid because the terminating > : is missing) is to not cause the bracket expression to be invalid, it is > absurd to believe that the simpler case of an unknown class name could do > so - simply absurd. It is not absurd, it makes perfect sense. In the case of [x[:bogus:]], the use of both colons clearly indicates the intention to use the new character-class feature. If the name between the colons is not a valid class name, that is likely due to an error on the user or application writer's part when typing the name. For example, if a user types: grep '[[:alhpa:]]' file it is much more useful if grep reports that the RE is invalid, than if it looks for a [, :, a, l, h, or p character followed by a ]. This is the same reason we added item 8, because utility implementers recognised that a user typing [:alpha:] is much more likely to be due to the user forgetting the outer brackets than intending to match one of the characters in that set, but POSIX was preventing them from behaving in a way that is more useful to the user. > | My point was that ksh93 treats [a"-"b] the same as [a-b] so trying > | to test something more specific to do with character classes in ksh93 > | is not going to yield any useful information. > > Again, sure, and again, not helpful for answering the question asked. > What buggy implementations happen to do is not really interesting. > What we want to know here is what the standard says should be done, > and perhaps also what it should say should be done. > > So, is [[:"alpha":]] required to be treated the same as [[:alpha:]] , > not allowed to be treated the same, explicitly unspecified, or simply > never considered (previously) ? I believe the intention is that it be treated the same as [[:alpha:]]. This is the only reasonable conclusion if you consider the similarity to: ls *"a"* Clearly the intention here is that the quotes are not treated as part of the pattern, even though pathname expansion is done before quote removal. > | My previous reply was based on XBD 9.3.5 item 4, but I have just spotted > | that the intro paragraph of 9.3.5 uses the word "may": > > Ater I saw your updated reply on this, which arriuved while I was composing > my previous message, I also went and looked at the standard, but I looked > at XCU 2.13.1: > The pattern bracket expression also shall match a single > collating element. > So there in the specific to the shell section, we have a "shall". Good catch. > Now both of those sections are poorly worded. In XBD 9.3.5 one might > interpret it as being "may" because not all bracket expressions match > collating elements, so it would be absurd to require them to do so. > > That is [abc] matches one of 'a' 'b' or 'c' and no collating elements > at all, and it would be absurd if the language in 9.3.5 required that > a specific set of multi-character collating elements shall be matched. > > Or perhaps the "may" there is as you just interpreted it, and means that > matching multi-char collating elements is optional, even when the > bracket expression is > [[=ch=]] > > Who knows? The word "may" has a strict usage. See XBD 1.5 - it "Describes a feature or behavior that is optional for an implementation that conforms to POSIX.1-2017." However, there have been cases in the past where incorrect uses "may" have been found and changed to "can". In any case, the "shall" in XCU 2.13.1 overrides it. -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: More issues with pattern matching
2019-09-26 16:28:27 +, Schwarz, Konrad: [...] > POSIX should disallow `:' and `]' in character class names. [...] While I would not disagree with that, I don't think POSIX prevents implementations from using [[:foo:]] for other things than character classes so that would not be enough to address concerns here. In practice, some implementations support [[:<:]] as the equivalent of the standard ex utility regexp \< operator. In those implementations, that doesn't even match a character let alone collating element let alone a class of them. (note that it's only for [[:<:]], not [x[:<:]] for instance in those implementations). One could choose to implement a [[:[<:>]:]] to match on smileys for instance :-) independantly of whether there's a class by that name. -- Stephane
RE: More issues with pattern matching
> -Original Message- > From: Harald van Dijk > Sent: Thursday, September 26, 2019 4:39 PM > To: austin-group-l@opengroup.org > Cc: austin-group-l@opengroup.org > Subject: Re: More issues with pattern matching > > On 26/09/2019 13:13, Robert Elz wrote: > > So, if we have > > > > [[:alpha] > > > > there is absolutely no question but that this is a bracket expr that > > matches one of the 7 chars > > [ : a l p h a > > and is in no way any kind of character class reference, whatever it > > looks like its author may have intended, and regardless of what comes > > after it. > > > > If the standard says any different, or implies different, or even > > allows different, it is simply wrong. > > If this is the whole pattern, then agreed, but if this is only part of the > pattern, I am not sure. [[:alpha]:]] > is interpreted by many shells (bash, bosh, mksh, zsh) as a character class > containing an invalid character class > name "alpha]". It may also be treated as such in ksh and yash, but as the > whole pattern fails to match anything, > it is hard to tell how exactly they interpret it. The interpretation as "any > of the characters in '[:alpha', > followed by ':]]', is something I only see in osh and in your shell. POSIX should disallow `:' and `]' in character class names.
RE: More issues with pattern matching
> -Original Message- > From: Robert Elz > So, is [[:"alpha":]] required to be treated the same as [[:alpha:]] , not > allowed to be treated the same, > explicitly unspecified, or simply never considered (previously) ? An argument for requiring [[:"alpha":]] to be the same as [[:alpha:]] is that it would allow character-class names with white space, e.g., "title case". Regards KAS
Re: More issues with pattern matching
On 26/09/2019 11:43, Geoff Clare wrote: My previous reply was based on XBD 9.3.5 item 4, but I have just spotted that the intro paragraph of 9.3.5 uses the word "may": A bracket expression ... is an RE that shall match a specific set of single characters, and may match a specific set of multi-character collating elements, ... So it appears that it is optional whether matching a bracket expression against more than one character is supported. This is a relief! Multi-character collating elements in the patterns may still need to be supported, but if they are only required to match single characters in the text being matched, that is doable. Cheers, Harald van Dijk
Re: More issues with pattern matching
On 26/09/2019 10:20, Geoff Clare wrote: Geoff Clare wrote, on 26 Sep 2019: Are shells required to support this, and are shells therefore implicitly required to translate patterns to regular expressions, or should it be okay to implement this with single character support only? Shells are required to support it. They don't need to translate entire patterns to regular expressions - they can use either regcomp()+regexec() or fnmatch() to see if the bracket expression matches the next character. Sorry, I should have written "matches *at* the next character" here; I didn't mean to imply checking against a single character. For example, if using regcomp()+regexec() the shell could try to match the bracket expression against the remainder of the string and see how much of it regexec() reported as matching. To use fnmatch() I suppose you would have to use it in a loop, passing it first one character, then two, etc. (stopping at the number of characters between the '.'s). Oh, I forgot about fnmatch(), I suppose that is generally an alternative. Both regcomp() and fnmatch() have a problem, which is that characters cannot be escaped by a backslash as it loses its special meaning in bracket expressions, but shell quoting allows arbitrary characters to appear. What should the shell do when fed [[=".=]"=]]? How is this implementable? Cheers, Harald van Dijk
Re: More issues with pattern matching
On 26/09/2019 13:13, Robert Elz wrote: So, if we have [[:alpha] there is absolutely no question but that this is a bracket expr that matches one of the 7 chars [ : a l p h a and is in no way any kind of character class reference, whatever it looks like its author may have intended, and regardless of what comes after it. If the standard says any different, or implies different, or even allows different, it is simply wrong. If this is the whole pattern, then agreed, but if this is only part of the pattern, I am not sure. [[:alpha]:]] is interpreted by many shells (bash, bosh, mksh, zsh) as a character class containing an invalid character class name "alpha]". It may also be treated as such in ksh and yash, but as the whole pattern fails to match anything, it is hard to tell how exactly they interpret it. The interpretation as "any of the characters in '[:alpha', followed by ':]]', is something I only see in osh and in your shell. Cheers, Harald van Dijk
Re: More issues with pattern matching
Date:Thu, 26 Sep 2019 11:43:37 +0100 From:Geoff Clare Message-ID: <20190926104337.GA25231@lt2.masqnet> | Good point. I think that this, and the behaviour I described, are | both allowed by the standard. If they are, they shouldn't be. Before char classes, equiv classes, and collating elements were invented, bracket expressions could contain anything (so could patterns in general). That makes it hard to add anything new without potentially invalidating previously valid code. The solution to that relies upon backet expressions being sets, where while legal, putting an element in the set more than once is a waste of time, and accomplishes nothing. That's why these new forms are defined only inside bracket expressions, and all have the property of a duplicated character in their syntax, that is, isn't just that [: :] looks pretty, whereas [: ] doesn't, it is the only way to more or less safely add this new form to patterns. So, if we have [[:alpha] there is absolutely no question but that this is a bracket expr that matches one of the 7 chars [ : a l p h a and is in no way any kind of character class reference, whatever it looks like its author may have intended, and regardless of what comes after it. If the standard says any different, or implies different, or even allows different, it is simply wrong. Now if this kind of "invalid char class" (invalid because the terminating : is missing) is to not cause the bracket expression to be invalid, it is absurd to believe that the simpler case of an unknown class name could do so - simply absurd. Either the unknown class name means that there is no character class, and all the text which looks like a character class is really just elements of the bracket expression, or the unknown class name is treated as probably being a class in some other locale, which has no members in the current locale, coujld be an interpretation which makes sense (though the latter is more useful, IMO). Invalidating the bracket expression makes no sense. | > | XBD 9.3.5 item 8 says it is unspecified whether [:bogus:] is treated as | > | a character class, treated as a matching list expression, or rejected | > | as an error. | > | > Yes, that is unfortunate, it should be specified than an unknown (but | > syntactically valid) class name in a character class is simply to be | > treated as a class containing no characters, | | Item 8 isn't about what's between the ':'s in [[:...:]], it's about | an RE that contains [:...:] without the outer pair of square brackets. Sure. But as I interpreted Harald's question, to which we are attempting to reply, things that look like char classes, but are not in a bracket expression, aren't relevant (nor is 9.3.5 item 8). The question was entirely about [x[:bogus:]] and [![:bogus:]] so perhaps we should stick to answering that, and avoid deviating into side issues. | My point was that ksh93 treats [a"-"b] the same as [a-b] so trying | to test something more specific to do with character classes in ksh93 | is not going to yield any useful information. Again, sure, and again, not helpful for answering the question asked. What buggy implementations happen to do is not really interesting. What we want to know here is what the standard says should be done, and perhaps also what it should say should be done. So, is [[:"alpha":]] required to be treated the same as [[:alpha:]] , not allowed to be treated the same, explicitly unspecified, or simply never considered (previously) ? | My previous reply was based on XBD 9.3.5 item 4, but I have just spotted | that the intro paragraph of 9.3.5 uses the word "may": Ater I saw your updated reply on this, which arriuved while I was composing my previous message, I also went and looked at the standard, but I looked at XCU 2.13.1: The pattern bracket expression also shall match a single collating element. So there in the specific to the shell section, we have a "shall". Which means | So it appears that it is optional whether matching a bracket expression | against more than one character is supported. perhaps not. Now both of those sections are poorly worded. In XBD 9.3.5 one might interpret it as being "may" because not all bracket expressions match collating elements, so it would be absurd to require them to do so. That is [abc] matches one of 'a' 'b' or 'c' and no collating elements at all, and it would be absurd if the language in 9.3.5 required that a specific set of multi-character collating elements shall be matched. Or perhaps the "may" there is as you just interpreted it, and means that matching multi-char collating elements is optional, even when the bracket expression is [[=ch=]] Who knows? XCU 2.13.1 is just as badly written, in the opposite direction. It (seems to) require every bracket expression to match a collating element. I doubt that is what it really intends
Re: More issues with pattern matching
Robert Elz wrote, on 26 Sep 2019: > > So, if bogus is not a valid char class for the locale (and if that is > treated as meaning the [:...:] is not a character class element of the > bracket expression, then the bracket expression is > [x[:bogus:] > where all chars between the initial '[' and the terminating ']' are > simply literal chars. So this will batch one char that is any of > : [ b g o s u x > and the pattern will batch a word that starts with one of those 7 chars > and is followed by a ']' char. Good point. I think that this, and the behaviour I described, are both allowed by the standard. > | XBD 9.3.5 item 8 says it is unspecified whether [:bogus:] is treated as > | a character class, treated as a matching list expression, or rejected > | as an error. > > Yes, that is unfortunate, it should be specified than an unknown (but > syntactically valid) class name in a character class is simply to be > treated as a class containing no characters, Item 8 isn't about what's between the ':'s in [[:...:]], it's about an RE that contains [:...:] without the outer pair of square brackets. I.e. it is unspecified whether [:alpha:] is treated as [[:alpha:]], treated as [:alph], or rejected. > | > 1b. Quoted character classes: > > | Some shells are known not to handle shell quoting correctly in bracket > | expressions (in general, not specific to character classes). > > This issue is specific to character classes (and is subtly different > than equivalence classes and collating symbols, as the syntax of the > name is defined, so we know quoting is never actually required for it, > unlike the others ... though I don't really believe that should make > a difference. > > The question is whether [:"alpha":] is the same as [:alpha:] or not. My point was that ksh93 treats [a"-"b] the same as [a-b] so trying to test something more specific to do with character classes in ksh93 is not going to yield any useful information. > | > 2a. Multi-character collating symbols and equivalence classes > | > > > | > LANG=cy_GB.UTF-8 > | > case ch in [[=ch=]]) echo x ;; esac # none > | > case ch in [[.ch.]]) echo x ;; esac # yash > | > case xch in x[[=ch=]]) echo x ;; esac # yash > > | Shells are required to support it. They don't need to translate > | entire patterns to regular expressions - they can use either > | regcomp()+regexec() or fnmatch() to see if the bracket expression > | matches the next character. (I later corrected this to "matches at the next character") > > The question here relates to "next character" - in the "case ch" where > the word being matched is "ch" is that one character, or two? A bracket > expression mateches just one, but an equivalence class may, as I understand > it, include dipthongs (so u-umlaut and ue might be treated the same, where > the former is one character, and the latter is two). > > Harald's question is whether shells are required to attempt to match > such things, rather than just "matches the next character" ? My previous reply was based on XBD 9.3.5 item 4, but I have just spotted that the intro paragraph of 9.3.5 uses the word "may": A bracket expression ... is an RE that shall match a specific set of single characters, and may match a specific set of multi-character collating elements, ... So it appears that it is optional whether matching a bracket expression against more than one character is supported. -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: More issues with pattern matching
Date:Thu, 26 Sep 2019 09:49:17 +0100 From:Geoff Clare Message-ID: <20190926084917.GA23815@lt2.masqnet> | The key here is the way 2.13.1 words the description of '[': | | If an open bracket introduces a bracket expression as in XBD | Section 9.3.5, except [...]. Otherwise, '[' shall match the | character itself. | | (This wording is being improved via bug 985 but that change does not | affect how it applies here.) | | If "bogus" is not a valid character class for the current locale, | then the "If" is not satisfied and [x[:bogus:]] is treated as a | literal [, a literal x, the bracket expression [:bogus:] and a | literal ]. That is most certainly not what would happen. If an unknown character class for the current locale is to be treated as invalid (which I think is an unworkable specification, but might be what the standard currently allows at least) then the [:bogus:] would not be a character class. That has no effect on the bracket expression itself, except for where it terminates. A bracket expression is simply an opening '[' followed by an optional ! followed by at least one more character, and then terminating with a ']' (all unquoted). I am ignoring the effects of a leading ^ for this, that's not material here as the example has no ^ in it at all. So, if bogus is not a valid char class for the locale (and if that is treated as meaning the [:...:] is not a character class element of the bracket expression, then the bracket expression is [x[:bogus:] where all chars between the initial '[' and the terminating ']' are simply literal chars. So this will batch one char that is any of : [ b g o s u x and the pattern will batch a word that starts with one of those 7 chars and is followed by a ']' char. | XBD 9.3.5 item 8 says it is unspecified whether [:bogus:] is treated as | a character class, treated as a matching list expression, or rejected | as an error. Yes, that is unfortunate, it should be specified than an unknown (but syntactically valid) class name in a character class is simply to be treated as a class containing no characters, and which consequently cannot match anything, so that [x[:bogus:]] simply matches 'x' and nothing else. We should, at the very least, add "future directions" that indicates that the standard will move in that direction in a later revision. | If it is not treated as a matching list, then the "If" in | 2.13.1 is again not satisfied and [:bogus:] is treated as a sequence | of literal characters. Not quite, the ']' would be the terminator for the bracket expression. | | > 1b. Quoted character classes: | Some shells are known not to handle shell quoting correctly in bracket | expressions (in general, not specific to character classes). This issue is specific to character classes (and is subtly different than equivalence classes and collating symbols, as the syntax of the name is defined, so we know quoting is never actually required for it, unlike the others ... though I don't really believe that should make a difference. The question is whether [:"alpha":] is the same as [:alpha:] or not. quoting the characters a l p h a doesn't alter their interpretation anywhere else, is it reasonable for it to do so here? This is not an issue of treating special chars as literals when they are quoted, as none of them are special, though I guess if we had IFS=a we may need to quote "alpha" to avoid [[:alpha:]] being field split into 3 words before pathname expansion gets a chance to interpret it as a pattern seeking files with one char alphabetic names. Since quote removal has not been performed at the time pathname expansion is done, and the standard says that until that happens. quote characters remain in the word, some shells have interpreted this as being a request to match the class named "alhpa" (literally, that is, a 7 char long name) which is syntactically valid (class names cannot contain '"' characters) and therefore not a valid char class, and certainly not the same as the unquoted form. We should make it clear that is not the case (similarly [[:\alph\a:]] [[:'a'lph'a':]] and all other variations) and these are to be treated the same as the quote removed version for the purposes of looking up the class name. | > 2a. Multi-character collating symbols and equivalence classes | > | > LANG=cy_GB.UTF-8 | > case ch in [[=ch=]]) echo x ;; esac # none | > case ch in [[.ch.]]) echo x ;; esac # yash | > case xch in x[[=ch=]]) echo x ;; esac # yash | Shells are required to support it. They don't need to translate | entire patterns to regular expressions - they can use either | regcomp()+regexec() or fnmatch() to see if the bracket expression | matches the next character. The question here relates to "next character" - in the "case ch" where the word being matched is "ch" is that one character, or
Re: More issues with pattern matching
Geoff Clare wrote, on 26 Sep 2019: > > > Are shells required to support this, and are shells therefore implicitly > > required to translate patterns to regular expressions, or should it be okay > > to implement this with single character support only? > > Shells are required to support it. They don't need to translate > entire patterns to regular expressions - they can use either > regcomp()+regexec() or fnmatch() to see if the bracket expression > matches the next character. Sorry, I should have written "matches *at* the next character" here; I didn't mean to imply checking against a single character. For example, if using regcomp()+regexec() the shell could try to match the bracket expression against the remainder of the string and see how much of it regexec() reported as matching. To use fnmatch() I suppose you would have to use it in a loop, passing it first one character, then two, etc. (stopping at the number of characters between the '.'s). -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: More issues with pattern matching
Harald van Dijk wrote, on 26 Sep 2019: > > >Eg: > > > > case x in [xabc) echo x;; esac > > > >is not "invalid" because the "bracket expression" has no terminating ']', > >rather it simply has no bracket expression at all, and fails to match > >here because it only matches the literal string '[xabc'. > > This is a special exception, a deviation from the regular expression syntax, > see 2.13.1: > > >If an open bracket introduces a bracket expression as in XBD RE Bracket > >Expression, except that the character ( '!' ) shall > >replace the character ( '^' ) in its role in a non-matching > >list in the regular expression notation, it shall introduce a pattern > >bracket expression. A bracket expression starting with an unquoted > > character produces unspecified results. Otherwise, '[' shall > >match the character itself. > > No such exception has been written for character classes and collating > elements. The reference to XBD RE Bracket Expression (9.3.5) applies to the whole of 9.3.5, which includes the descriptions of what constitutes a valid character class or collating element. -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: More issues with pattern matching
Harald van Dijk wrote, on 25 Sep 2019: > > After comparing what my shell does now during pattern matching to what it > should, I found a few more cases where I do not believe POSIX is clear about > what is required and where shells are not in agreement. These are not > related to the backslash handling. This isn't a complete response to all the points - I'm just noting some things that I don't think other responders have mentioned. > 1a. Invalid character classes: > > case x in [x[:bogus:]]) echo x ;; esac # bash,bosh,mksh,nbsh,osh,zsh > case x in [![:bogus:]]) echo x ;; esac # above except osh > > The handling of this in dash, inherited by my shell, is just buggy and > should be ignored. > > In bash, bosh, mksh, nbsh, zsh, a character does not match an invalid > character class. In osh, a character neither matches nor fails to match an > invalid character class, but the pattern is still valid. In yash, the use of > [:bogus:] renders the whole pattern invalid. > > These all seem reasonable choices. regcomp() would reject the whole pattern > as an error, and character classes are supposed to behave as they do in > regular expressions, so I believe yash's behaviour makes the most sense. Is > that correct? The key here is the way 2.13.1 words the description of '[': If an open bracket introduces a bracket expression as in XBD Section 9.3.5, except [...]. Otherwise, '[' shall match the character itself. (This wording is being improved via bug 985 but that change does not affect how it applies here.) If "bogus" is not a valid character class for the current locale, then the "If" is not satisfied and [x[:bogus:]] is treated as a literal [, a literal x, the bracket expression [:bogus:] and a literal ]. XBD 9.3.5 item 8 says it is unspecified whether [:bogus:] is treated as a character class, treated as a matching list expression, or rejected as an error. If it is not treated as a matching list, then the "If" in 2.13.1 is again not satisfied and [:bogus:] is treated as a sequence of literal characters. > 1b. Quoted character classes: > > Shells agree that quoting disables the recognition of character classes, but > they disagree on how much quoting disables it. > > case x in ["[:alnum:]"]) echo x ;; esac # none > case x in [[:"alnum:]"]) echo x ;; esac # none > case x in [[:"alnum:"]]) echo x ;; esac # ksh, mksh, yash, zsh > case x in [[:\alnum:]]) echo x ;; esac # above plus osh > case x in [[:"alnum":]]) echo x ;; esac # above plus dash, nbsh > > I believe that as the special characters to indicate a character class are > "[:" and ":]", the osh behaviour is correct, the character class name is > allowed to be quoted. Is that correct? The dash/nbsh behaviour, again > inherited by my shell, is close, but the fact that the type of quoting > affects how the character class is treated looks like a bug. Some shells are known not to handle shell quoting correctly in bracket expressions (in general, not specific to character classes). I think this came to light during discussion of bug 1190. I seem to recall ksh93 being the main culprit, but other shells may have had bugs as well. > 2. Collating symbols and equivalence classes > > Collating symbols and equivalence classes are less widely implemented. > > case x in [[.x.]]) echo x ;; esac # bash, ksh, mksh, osh, yash > case x in [[=x=]]) echo x ;; esac # same > case ä in [[=a=]]) echo x ;; esac # bash, ksh, yash > case a in [[=ä=]]) echo x ;; esac # same > > The handling of brackets in pattern matching is defined by reference to RE > Bracket Expression and no exception has been made for them, so these are > supposed to be handled in pattern matching as well. > > 2a. Multi-character collating symbols and equivalence classes > > Multi-character support seems impossible to implement portably other than by > translating patterns to regular expressions as yash does. POSIX does not > provide any other means to ask the implementation enough information about > what is supported in the current locale. And when things to get translated > to regular expressions, it relies on libc support, with glibc behaving > strangely, but this may just be my limited understanding of how things are > supposed to work. > > LANG=cy_GB.UTF-8 > case ch in [[=ch=]]) echo x ;; esac # none > case ch in [[.ch.]]) echo x ;; esac # yash > case xch in x[[=ch=]]) echo x ;; esac # yash > > Are shells required to support this, and are shells therefore implicitly > required to translate patterns to regular expressions, or should it be okay > to implement this with single character support only? Shells are required to support it. They don't need to translate entire patterns to regular expressions - they can use either regcomp()+regexec() or fnmatch() to see if the bracket expression matches the next character. > 2b. Invalid collating elements > > As with invalid character classes: > > case x in [x[.xy.]]) echo x ;; esac # bash,
Re: More issues with pattern matching
On 26/09/2019 02:36, Harald van Dijk wrote: POSIX mentions the possibility of locale-specific character classes and they are required to be recognised in regular expressions and therefore in shell glob patterns: In addition, character class expressions of the form: [:name:] are recognized in those locales where the name keyword has been given a charclass definition in the LC_CTYPE category. I have not checked which shells implement this correctly. I know mine does not. I was assuming a locale that does not define [:bogus:] as a character class, but should have specified. I meant to include here that under a locale that does not define [:bogus:] as a character class, I expect [[:bogus:]] to silently not match anything, like you. I referred back to it later, but forgot to actually write it. Cheers, Harald van Dijk
Re: More issues with pattern matching
On 26/09/2019 01:47, Robert Elz wrote: Date:Wed, 25 Sep 2019 22:29:36 +0100 From:Harald van Dijk Message-ID: | These all seem reasonable choices. regcomp() would reject the whole | pattern as an error, and character classes are supposed to behave as | they do in regular expressions, so I believe yash's behaviour makes the | most sense. Is that correct? Character classes (and bracket expressions) behave like they do in regcomps, but in glob patterns there is no such thing as "invalid", patterns that one might assume to be invalid are in reality patterns that match something different than looks like might have been intended at first glance. The possibility of invalid patterns is explicitly acknowledged in the description of patterns ending with an unescaped backslash: If a pattern ends with an unescaped , it is unspecified whether the pattern does not match anything or the pattern is treated as invalid. However, yash treats the patterns I described as not matching anything, it does not raise any errors for them. I had not considered the possibility of shells raising errors for them and I agree that that is not desirable. So, if one were to decide that [:bogus:] is not a valid character class, as the name is not valid -- which I think would be a truly poor choice, as locales are free to define new character classes, and this approach would make it impossible to ever safely attempt to use such class ... eg: some languages might (I have no idea if they do or not) define a character class tonemark (with whatever spelling) to match the "characters" that indicate the "tone" (which I kind of understand, but am unable to describe) that is to me used (in Thai there are I believe 5 different words that all sound like "ma" with different tones, and wildly different meanings, dog, horse, come (I don't know the other two) they are probnounced with different tones (high, low, rising, falling and normal). I think there are 7 tones in Vietnamese. There are glyphs that are written above the consonant (the 'm' in this case - the Thai 'm' obviously) that indicate which tone (actually take the base tone implied by the consonant in question and modify it, rather than being absolute). Those glyphs are present in written text as a character following the consonant. It would be entirely reasonable for a script to look for something like (usng Thai chars for the 'm' and 'a' of course) m[[:tonemark:]]a to match any of those words (except the one that has no mark, never mind, we'd need ma | m[[:tonemark:]]a ). If we were to treat this as invalid, that is, generate an error (and would it be a "compile time" error, or execution time?) just because we happen to be in some non tone mark using locale, rather than simply not matching, things get very difficult. POSIX mentions the possibility of locale-specific character classes and they are required to be recognised in regular expressions and therefore in shell glob patterns: In addition, character class expressions of the form: [:name:] are recognized in those locales where the name keyword has been given a charclass definition in the LC_CTYPE category. I have not checked which shells implement this correctly. I know mine does not. I was assuming a locale that does not define [:bogus:] as a character class, but should have specified. But if that were to happen and the character class is invalid, then the bracket expression is [[:tonemark:] and is the set of any of the characters '[' ':' 't' 'o' ... 'r' 'k' (and an extra redundant ':' - since it is a set, duplicates are ignored just as they are in [aaa]) The pattern above would be matched by a word that contains an 'm' followed by one from that set, followed by a ']' followed by an 'a'. That's even worse than generating an error. Chaacter classes really must be treated as character classes, regardless of whether they're recognised in the current locale or not. This should not be unspecified. Agreed, regardless of what POSIX currently says. This is what I was referring to with "The handling of this in dash, inherited by my shell, is just buggy and should be ignored." I intend to fix this, but am not sure of the details yet. [...] | As with invalid character classes: | |case x in [x[.xy.]]) echo x ;; esac # bash, ksh, mksh | | This would be rejected with an error by regcomp(), so rejecting the | whole pattern makes most sense to me. Same as above, no shell pattern is ever rejected, no matter what. Same as above, by "rejected" I meant that it is treated as a never-matching pattern. Eg: case x in [xabc) echo x;; esac is not "invalid" because the "bracket expression" has no terminating ']', rather it simply has no bracket expression at all, and fails to match here because it only matches the literal string '[xabc'. This is a special exception, a deviation from the regular expression
Re: More issues with pattern matching
Date:Wed, 25 Sep 2019 22:29:36 +0100 From:Harald van Dijk Message-ID: | These all seem reasonable choices. regcomp() would reject the whole | pattern as an error, and character classes are supposed to behave as | they do in regular expressions, so I believe yash's behaviour makes the | most sense. Is that correct? Character classes (and bracket expressions) behave like they do in regcomps, but in glob patterns there is no such thing as "invalid", patterns that one might assume to be invalid are in reality patterns that match something different than looks like might have been intended at first glance. So, if one were to decide that [:bogus:] is not a valid character class, as the name is not valid -- which I think would be a truly poor choice, as locales are free to define new character classes, and this approach would make it impossible to ever safely attempt to use such class ... eg: some languages might (I have no idea if they do or not) define a character class tonemark (with whatever spelling) to match the "characters" that indicate the "tone" (which I kind of understand, but am unable to describe) that is to me used (in Thai there are I believe 5 different words that all sound like "ma" with different tones, and wildly different meanings, dog, horse, come (I don't know the other two) they are probnounced with different tones (high, low, rising, falling and normal). I think there are 7 tones in Vietnamese. There are glyphs that are written above the consonant (the 'm' in this case - the Thai 'm' obviously) that indicate which tone (actually take the base tone implied by the consonant in question and modify it, rather than being absolute). Those glyphs are present in written text as a character following the consonant. It would be entirely reasonable for a script to look for something like (usng Thai chars for the 'm' and 'a' of course) m[[:tonemark:]]a to match any of those words (except the one that has no mark, never mind, we'd need ma | m[[:tonemark:]]a ). If we were to treat this as invalid, that is, generate an error (and would it be a "compile time" error, or execution time?) just because we happen to be in some non tone mark using locale, rather than simply not matching, things get very difficult. But if that were to happen and the character class is invalid, then the bracket expression is [[:tonemark:] and is the set of any of the characters '[' ':' 't' 'o' ... 'r' 'k' (and an extra redundant ':' - since it is a set, duplicates are ignored just as they are in [aaa]) The pattern above would be matched by a word that contains an 'm' followed by one from that set, followed by a ']' followed by an 'a'. That's even worse than generating an error. Chaacter classes really must be treated as character classes, regardless of whether they're recognised in the current locale or not. This should not be unspecified. Ignoring "errors" is the way glob matching has always worked, something that cannot be interpreted as what it appears to be simply means something different. Regular expressions are different, they are more formally defined, and have always had valid, and invalid cases - so what regcomp() happens to do isn't really relevant to what happens with shell patterns. They are entirely different beasts. | 1b. Quoted character classes: | | I believe that as the special characters to indicate a character class | are "[:" and ":]", the osh behaviour is correct, the character class | name is allowed to be quoted. I agree, there is nothing that suggests it should be otherwise. | The dash/nbsh behaviour, | again inherited by my shell, is close, but the fact that the type of | quoting affects how the character class is treated looks like a bug. I would agree with that, and for nbsh I will take a look (I haven't yet verified this, but I certainly believe it could be that way) and fix it. | 2. Collating symbols and equivalence classes | | Collating symbols and equivalence classes are less widely implemented. nbsh is one that doesn't implement them at all. That's a defect, they should be supported, but fixing this is kind of low priority as in practice, nothing seems to use them (other than tests to see if they are implemented) so don't necessarily expect a resolution any time soon (it isn't just the shell, I am not sure if there's any support for those in NetBSD's locale system at all). | As with invalid character classes: | |case x in [x[.xy.]]) echo x ;; esac # bash, ksh, mksh | | This would be rejected with an error by regcomp(), so rejecting the | whole pattern makes most sense to me. Same as above, no shell pattern is ever rejected, no matter what. Eg: case x in [xabc) echo x;; esac is not "invalid" because the "bracket expression" has no terminating ']', rather it simply has no bracket expression at all, and fails to match here because it only matches the literal
Re: More issues with pattern matching
On 26/09/2019 00:18, Shware Systems wrote: While it may not be mentioned in that thread, P182, L6005 explictly has the blanket "violations ... produce undefined results" that I see could apply for bogus names. It would be a semantics error more than a syntax one, but language is there. However, 9.3.5 could also be construed as all other names represent an empty set to be unioned with the set of elements for 9.3.5, 6b. as the class checked, and if an implementation does not provide any global set (since locale definitions have no means to define a per locale set) then no match is the required behavior. If a global set is provided, a match may occur, but either way all values for name are potentially valid. This looks like a contradiction between the specification of regular expressions and the specification of regcomp(). I agree that the specification for regular expressions does not say that unrecognised names are "invalid" and that the failure to say so renders the results undefined. At the same time, regcomp() says: The following constants are defined as the minimum set of error return values, although other errors listed as implementation extensions in are possible: REG_ECTYPE Invalid character class type referenced. An implementation that allows any bogus name by either not parsing it as a character class or by parsing it as a character class that never matches any character would never have regcomp() return REG_ECTYPE, but by stating that the minimum set of error return values includes REG_ECTYPE, the standard requires regcomp() to return REG_ECTYPE for at least some patterns. Cheers, Harald van Dijk
Re: More issues with pattern matching
On 25/09/2019 22:49, Stephane Chazelas wrote: 2019-09-25 22:29:36 +0100, Harald van Dijk: [...] 1a. Invalid character classes: case x in [x[:bogus:]]) echo x ;; esac # bash,bosh,mksh,nbsh,osh,zsh case x in [![:bogus:]]) echo x ;; esac # above except osh [...] See also https://www.mail-archive.com/austin-group-l%40opengroup.org/msg02247.html (and the rest of that thread). Thanks. This does not cover all of my questions, but does cover some of them. I agree with Robert Elz's comment there: I truly dislike that kind of approach in the standard - particularly if it is deliberate. Readers of the text don't know that it is actually unspecified, as it might be specified somewhere else they haven't found yet. The fact that character classes in patterns are defined by reference to regular expressions, and in regular expressions they render the whole regular expression invalid per regcomp()'s REG_ECTYPE error return value, does not appear to be mentioned in that thread. This may change the conclusion that the behaviour is implicitly unspecified. Cheers, Harald van Dijk
Re: More issues with pattern matching
2019-09-25 22:29:36 +0100, Harald van Dijk: [...] > 1a. Invalid character classes: > > case x in [x[:bogus:]]) echo x ;; esac # bash,bosh,mksh,nbsh,osh,zsh > case x in [![:bogus:]]) echo x ;; esac # above except osh [...] See also https://www.mail-archive.com/austin-group-l%40opengroup.org/msg02247.html (and the rest of that thread). -- Stephane