I still find the resolution Bug1857 not satisfactory. (topic: regex)

Niu Danny via austin-group-l at The Open Group Mon, 03 Mar 2025 22:26:06 -0800

While many ISO standards are specified in terms of
performance and quality metrics, IT standards like
ours are for promoting interoperability, so the interaction
between applications and implementations needs to be considered.


To better understand the problem I have and we may be facing,
let's consider a regular expression implementation where 

1. the first match is found in the same manner as 
   Perl/PHP/Python/PCRE semantic where quantifiers
   starts with their "best" (Perl terminology) value,

2. the implementation adjust the quantifiers from right to left
   to discover potentially longer matches where minimal
   quantifiers would have shorter matches.

3. repeat 2 until all combinations are exhausted, then
   return the best (in POSIX sense, i.e. length) match.

A summary of problems/questions I have:

----

With the current resolution of Bug-1857
// 
<//www.austingroupbugs.net/view.php?id=1857#c6881>www.austingroupbugs.net/view.php?id=1857#c6881
 <http://www.austingroupbugs.net/view.php?id=1857#c6881> , we have:

> Consistent with the match for the entire 
> regular expression being the leftmost and 
> longest for which any minimal repetitions 
> used in the match have the shortest possible 
> match, 

Q1: does the "for which" clause imply that
    if there are any minimal quantifiers,
    the overall match may *Not Necessarily* be 
    the longest?

The examples from the previous paragraph seem to confirm this:

> However, the ERE "(aaa??)*" matches only 
> the first four characters of the string "aaaaa", 
> not all five, because in order to match all five, 
> "a??" would match with length one instead of zero; 
> the ERE "(aaa??)*|(aaa?)*" matches all five because 
> the longest match is one which does not use 
> any minimal repetitions.

In which case, I think the length of the overall match 
is ambiguous.

----

> each BRE or ERE in a concatenated set, 
> from left to right, shall match the longest 
> possible string for which any minimal repetitions 
> used in the match for that BRE or ERE have 
> the shortest possible match.

Q2: are the said BRE and ERE parenthesized?

It is mentioned in a bug note (from @geoffclare): 
www.austingroupbugs.net/view.php?id=1857#c6890 
<http://www.austingroupbugs.net/view.php?id=1857#c6890>

----

> There is certainly no intention to require 
> the '?' modifier to act recursively, and 
> I can't see any way to interpret my suggested 
> wording as implying it.

Q3: How can it simultaneously:

- not act recursively,
- match the shortest subject string when
  it's applied to a parenthesized subexpression
  with a greedy quantifier in it?

e.g. `([0-9]+)+?`

----

Observation 1:

@geoffclare replying to @dannyniu 
www.austingroupbugs.net/view.php?id=1857#c6883 
<http://www.austingroupbugs.net/view.php?id=1857#c6883>

>> if both greedy **AND** lazy quantifiers're nested ...

> That was the reason for wording it as "longest 
> possible ... for which any minimal repetitions used ... 
> have the shortest possible match". A minimal 
> repetition nested inside a greedy one has precedence 
> (if used); otherwise, each just follows its normal rule.

However, greedy ones nested inside minimal ones are 
not discussed, and I think this should be added.

----

Observation 2:

@steffen did experiment on PCRE and TRE, and
the result seem to conflict with Geoff's interpretation 
of Danny's torture testing regular expression and 
subject string

Steffen's note:
https://www.austingroupbugs.net/view.php?id=1857#c6888

Geoff's note:
https://www.austingroupbugs.net/view.php?id=1857#c6898

I still find the resolution Bug1857 not satisfactory. (topic: regex)

Reply via email to