Re: More issues with pattern matching

Geoff Clare Fri, 27 Sep 2019 02:39:59 -0700

Robert Elz <k...@munnari.oz.au> wrote, on 27 Sep 2019:
>
>   | In the case of [x[:bogus:]], the use of both colons clearly indicates
>   | the intention to use the new character-class feature.  If the name
>   | between the colons is not a valid class name, that is likely due to
>   | an error on the user or application writer's part when typing the name.
> 
> I had been waiting for that argument, it is the only one that is
> half way rational, and supports that position.  But half way is as
> far as it gets.
> 
> POSIX allows locales to define new char classes, it says so, XBD 7.3,
> page 141, lines 4218-4226.
> 
> Since a locale is allowed to define a new char class name, the shell
> (or regcomp() for the RE case) cannot know whether the user here:
> 
>   | For example, if a user types:
>   |
>   | grep '[[:alhpa:]]' file
> 
> made a typo for alpha (the standard posix defined char class), or really
> intended alhpa a locale specific char class in some locale which is not
> the current one.
> 
> Making this some kind of error, in either REs, or shell patterns (whatever
> the effect of that is) makes it impossible for users to ever safely, and
> simply, use the locale specific locale name.
> 
> They cannot even test which locale is in use as (aside from it being 
> impossible
> to be sure which locales have added this new char class to their definitions)
> there's no guarantee that even if we know that LC_CTYPE=EN_dislexic
> contains the alhpa character class, in some implementations, there is no
> sane way to know whether the current impoementation does.
> 
> That is, unless you're requiring that before a locale specific char class
> can be used, the user (on the command line) or script, is required to
> query the locale and test whether the char class is defined there or not.
> 
> Requiring that would be absurd.


You might consider it absurd, but it is what the standard requires
applications and users to do in order to avoid "undefined results"
(as per XBD 9.1 under "invalid").  The standard even acknowledges that
applications need to be able to do that, in the APPLICATION USAGE for
the locale utility:

    Implementations are not required to write out the actual values
    for keywords in the categories LC_CTYPE and LC_COLLATE ; however,
    they must write out the categories (allowing an application to
    determine, for example, which character classes are available).

In a C program, finding out if a character class name is valid for the
current locale is simply a matter of calling wctype(name) and checking
whether it returns (wctype_t)0.  So applications using fnmatch(), glob()
or regcomp() can do that before using a name that isn't one of the
mandated ones.

>   | > So, is [[:"alpha":]] required to be treated the same as [[:alpha:]] ,
>   | > not allowed to be treated the same, explicitly unspecified, or simply
>   | > never considered (previously) ?
>   |
>   | I believe the intention is that it be treated the same as [[:alpha:]].
> 
> Good, that is what I would have hoped.   Now maybe we should add something
> to make that explicit.

Yes, I think an addition is warranted. Maybe we should add a new
paragraph to 2.13 (before the 2.13.1 heading) along the lines of:

    In the shell, any quoting characters (see [xref to 2.2]) that are
    present in a word to be used as a pattern, and are treated as
    special, shall participate in pattern matching only through their
    effects on other characters; they shall not themselves be treated
    as pattern characters. For example:

    ls -ld \\*

    lists files with names that begin with a single <backslash>,

    ls -ld "?"*

    lists files with names that begin with a <question-mark>,

    ls -ld [[:'alpha':]]*

    lists files with names that begin with an alphabetic character in
    the current locale, and

    ls -ld [[':alpha:']]*

    lists files with names that begin with a character from the set
    { '[', ':', 'a', 'l', 'p', 'h' } followed by a ']'.

> 
>   | The word "may" has a strict usage.  See XBD 1.5 - it "Describes a
>   | feature or behavior that is optional for an implementation that
>   | conforms to POSIX.1-2017."
>   |
>   | However, there have been cases in the past where incorrect uses "may"
>   | have been found and changed to "can".
>   |
>   | In any case, the "shall" in XCU 2.13.1 overrides it.
> 
> Only for shell patterns, we still need to decide whether it was the
> defined "may" or an erroneous use which should be replaced by "can"
> for regular expressions.   Given the shell imperative, and the desire
> to make bracket expressions in sh patterns and REs as equivalent as
> possible, I suspect the latter.

The rationale in XRAT says the opposite (A.9.1):

    The ISO POSIX-2:1993 standard required bracket expressions like
    "[^[:lower:]]" to match multi-character collating elements such as
    "ij". However, this requirement led to behavior that many users
    did not expect and that could not feasibly be mimicked in user
    code, and it was rarely if ever implemented correctly. The current
    standard leaves it unspecified whether a bracket expression
    matches a multi-character collating element, allowing both
    historical and ISO POSIX-2:1993 standard implementations to
    conform.

This rationale would have been added at the same time the word "may" was
introduced in the normative text.  So the "shall" in XCU 2.13.1 appears
to be something that was overlooked when this change was made.

-- 
Geoff Clare <g.cl...@opengroup.org>
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England

Re: More issues with pattern matching

Reply via email to