[pcre-dev] [Bug 1504] DFA matching seems to have regressed, causing GLib test failure

Philip Hazel Mon, 21 Jul 2014 10:45:13 -0700

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1504

--- Comment #7 from Philip Hazel <[email protected]>  2014-07-21 18:44:21 
---
On Mon, 21 Jul 2014, Simon McVittie wrote:

> What message should I be passing on to the GLib maintainers? We've ruled out
> "the new PCRE is wrong", leaving the possibilities as either "the new PCRE is
> intentionally not fully compatible with the old" or "GLib's regression tests
> should not have been asserting that that precise behaviour is present".

Oh dear. You seem to have discovered a can of worms here, caused by 
oversights on my (and others) part as the PCRE code has evolved. The
documentation is not as helpful as it might be. In the pcreapi page it
says:

  A second matching function, pcre_dfa_exec(), which is not
  Perl-compatible, is also provided. This uses a different algorithm for
  the matching. The alternative algorithm finds all possible matches (at
  a given point in the subject), and scans the subject just once (unless
  there are lookbehind assertions). However, this algorithm does not
  return captured substrings.

But in the pcrematching page, it modifies the statement about finding 
all possible matches:

  PCRE's "auto-possessification" optimization usually applies to
  character repeats at the end of a pattern (as well as internally). For
  example, the pattern "a\d+" is compiled as if it were "a\d++" because
  there is no point even considering the possibility of backtracking
  into the repeated digits. For DFA matching, this means that only one
  possible match is found. If you really do want multiple matches in
  such cases, either use an ungreedy repeat ("a\d+?") or set the
  PCRE_NO_AUTO_POSSESS option when compiling.

I am at present in the middle of developing an entirely new API for PCRE 
(called PCRE2, and discussed on the list some months ago). Once this is 
done (most of the code is done and I'm working on converting the tests), 
there will be a complete revision of the documentation, and I will try 
to improve the DFA documentation to make it all clearer. I think the 
bottom line is "please use PCRE2_NO_AUTO_POSSESS if you want to get all 
possible matches from DFA matching" but it needs more explanation and 
examples.

> If the new PCRE is intentionally not fully compatible with the old, perhaps we
> should be looking into a SONAME bump and a coordinated transition...

The changes were intentional (and I'm sure I bumped something, but 
perhaps not enough) but we obviously didn't recognize the extent of the 
incompatibility. As PCRE tries to track Perl, there may well be other 
things like (?P<1>) in future ... in fact I have just discovered today 
that Perl's treatment of \8 and \9 has changed in the absence of groups
numbered 8 or 9, and its treatment of \c when not followed by a 
printable ASCII character is also different (it now gives an error).

> If GLib's regression tests are just being too picky, and should not have been
> making those assertions, then that's also useful information, and perhaps
> points to GLib's documentation being too specific about the expected results.
> What result should be expected from matching a+ against aaa in this mode?

I don't think the tests are too picky. This has flagged up something 
that can be improved.

I think the DFA matching process should be much more clearly laid out in
the PCRE documentation; the extension of auto-possessification has
changed the situation. It should say something like this for DFA
matching /a+/ (i.e. a complete pattern, not as part of something else):

. Without PCRE_NO_AUTO_POSSESS the result is the single string "aaa".
. With PCRE_NO_AUTO_POSSESS the result is three matches, "aaa", "aa", and "a".

For normal (non-DFA) matching, the result is of course just "aaa".

> The GLib documentation currently says
> 
> """
> Using the standard algorithm for regular expression matching only the longest
> match in the string is retrieved, it is not possible to obtain all the
> available matches. For instance matching "<a> <b> <c>" against the pattern
> "<.*>" you get "<a> <b> <c>".
> 
> This function uses a different algorithm (called DFA, i.e. deterministic 
> finite
> automaton), so it can retrieve all the possible matches, all starting at the
> same point in the string. For instance matching "<a> <b> <c>" against the
> pattern "<.*>;" you would obtain three matches: "<a> <b> <c>", "<a> <b>" and
> "<a>".
> """

That is correct. When there are explicit possessive quantifiers in the 
pattern, the number of available matches may be smaller, but the creator 
of the pattern should realize this. The problem situation is when PCRE 
does some auto-possessification behind the user's back - this won't 
always cause a problem, but it can, as we have seen, especially if more 
cases are added to the auto-possessification code (which is what has 
just happened).

If the intention of the GLib function is always to provide all possible 
matches, then I would always recommend using PCRE_NO_AUTO_POSSESS.

Philip

-- 
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

[pcre-dev] [Bug 1504] DFA matching seems to have regressed, causing GLib test failure

Reply via email to