Re: More issues with pattern matching

Geoff Clare Thu, 26 Sep 2019 03:44:09 -0700

Robert Elz <k...@munnari.oz.au> wrote, on 26 Sep 2019:
>
> So, if bogus is not a valid char class for the locale (and if that is
> treated as meaning the [:...:] is not a character class element of the
> bracket expression, then the bracket expression is
>       [x[:bogus:]
> where all chars between the initial '[' and the terminating ']' are
> simply literal chars.   So this will batch one char that is any of
>       : [ b g o s u x
> and the pattern will batch a word that starts with one of those 7 chars
> and is followed by a ']' char.


Good point.  I think that this, and the behaviour I described, are
both allowed by the standard.

>   | XBD 9.3.5 item 8 says it is unspecified whether [:bogus:] is treated as
>   | a character class, treated as a matching list expression, or rejected
>   | as an error.
> 
> Yes, that is unfortunate, it should be specified than an unknown (but
> syntactically valid) class name in a character class is simply to be
> treated as a class containing no characters,

Item 8 isn't about what's between the ':'s in [[:...:]], it's about
an RE that contains [:...:] without the outer pair of square brackets.
I.e. it is unspecified whether [:alpha:] is treated as [[:alpha:]],
treated as [:alph], or rejected.

>   | > 1b. Quoted character classes:
> 
>   | Some shells are known not to handle shell quoting correctly in bracket
>   | expressions (in general, not specific to character classes).
> 
> This issue is specific to character classes (and is subtly different
> than equivalence classes and collating symbols, as the syntax of the
> name is defined, so we know quoting is never actually required for it,
> unlike the others ... though I don't really believe that should make
> a difference.
> 
> The question is whether [:"alpha":] is the same as [:alpha:] or not.

My point was that ksh93 treats [a"-"b] the same as [a-b] so trying
to test something more specific to do with character classes in ksh93
is not going to yield any useful information.

>   | > 2a. Multi-character collating symbols and equivalence classes
>   | > 
> 
>   | >   LANG=cy_GB.UTF-8
>   | >   case  ch in  [[=ch=]]) echo x ;; esac # none
>   | >   case  ch in  [[.ch.]]) echo x ;; esac # yash
>   | >   case xch in x[[=ch=]]) echo x ;; esac # yash
> 
>   | Shells are required to support it.  They don't need to translate
>   | entire patterns to regular expressions - they can use either
>   | regcomp()+regexec() or fnmatch() to see if the bracket expression
>   | matches the next character.
(I later corrected this to "matches at the next character")
> 
> The question here relates to "next character" - in the "case ch" where
> the word being matched is "ch" is that one character, or two?   A bracket
> expression mateches just one, but an equivalence class may, as I understand
> it, include dipthongs (so u-umlaut and ue might be treated the same, where
> the former is one character, and the latter is two).
> 
> Harald's question is whether shells are required to attempt to match
> such things, rather than just "matches the next character" ?

My previous reply was based on XBD 9.3.5 item 4, but I have just spotted
that the intro paragraph of 9.3.5 uses the word "may":

    A bracket expression ... is an RE that shall match a specific set
    of single characters, and may match a specific set of
    multi-character collating elements, ...

So it appears that it is optional whether matching a bracket expression
against more than one character is supported.

-- 
Geoff Clare <g.cl...@opengroup.org>
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England

Re: More issues with pattern matching

Reply via email to