------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1504 --- Comment #7 from Philip Hazel <[email protected]> 2014-07-21 18:44:21 --- On Mon, 21 Jul 2014, Simon McVittie wrote: > What message should I be passing on to the GLib maintainers? We've ruled out > "the new PCRE is wrong", leaving the possibilities as either "the new PCRE is > intentionally not fully compatible with the old" or "GLib's regression tests > should not have been asserting that that precise behaviour is present". Oh dear. You seem to have discovered a can of worms here, caused by oversights on my (and others) part as the PCRE code has evolved. The documentation is not as helpful as it might be. In the pcreapi page it says: A second matching function, pcre_dfa_exec(), which is not Perl-compatible, is also provided. This uses a different algorithm for the matching. The alternative algorithm finds all possible matches (at a given point in the subject), and scans the subject just once (unless there are lookbehind assertions). However, this algorithm does not return captured substrings. But in the pcrematching page, it modifies the statement about finding all possible matches: PCRE's "auto-possessification" optimization usually applies to character repeats at the end of a pattern (as well as internally). For example, the pattern "a\d+" is compiled as if it were "a\d++" because there is no point even considering the possibility of backtracking into the repeated digits. For DFA matching, this means that only one possible match is found. If you really do want multiple matches in such cases, either use an ungreedy repeat ("a\d+?") or set the PCRE_NO_AUTO_POSSESS option when compiling. I am at present in the middle of developing an entirely new API for PCRE (called PCRE2, and discussed on the list some months ago). Once this is done (most of the code is done and I'm working on converting the tests), there will be a complete revision of the documentation, and I will try to improve the DFA documentation to make it all clearer. I think the bottom line is "please use PCRE2_NO_AUTO_POSSESS if you want to get all possible matches from DFA matching" but it needs more explanation and examples. > If the new PCRE is intentionally not fully compatible with the old, perhaps we > should be looking into a SONAME bump and a coordinated transition... The changes were intentional (and I'm sure I bumped something, but perhaps not enough) but we obviously didn't recognize the extent of the incompatibility. As PCRE tries to track Perl, there may well be other things like (?P<1>) in future ... in fact I have just discovered today that Perl's treatment of \8 and \9 has changed in the absence of groups numbered 8 or 9, and its treatment of \c when not followed by a printable ASCII character is also different (it now gives an error). > If GLib's regression tests are just being too picky, and should not have been > making those assertions, then that's also useful information, and perhaps > points to GLib's documentation being too specific about the expected results. > What result should be expected from matching a+ against aaa in this mode? I don't think the tests are too picky. This has flagged up something that can be improved. I think the DFA matching process should be much more clearly laid out in the PCRE documentation; the extension of auto-possessification has changed the situation. It should say something like this for DFA matching /a+/ (i.e. a complete pattern, not as part of something else): . Without PCRE_NO_AUTO_POSSESS the result is the single string "aaa". . With PCRE_NO_AUTO_POSSESS the result is three matches, "aaa", "aa", and "a". For normal (non-DFA) matching, the result is of course just "aaa". > The GLib documentation currently says > > """ > Using the standard algorithm for regular expression matching only the longest > match in the string is retrieved, it is not possible to obtain all the > available matches. For instance matching "<a> <b> <c>" against the pattern > "<.*>" you get "<a> <b> <c>". > > This function uses a different algorithm (called DFA, i.e. deterministic > finite > automaton), so it can retrieve all the possible matches, all starting at the > same point in the string. For instance matching "<a> <b> <c>" against the > pattern "<.*>;" you would obtain three matches: "<a> <b> <c>", "<a> <b>" and > "<a>". > """ That is correct. When there are explicit possessive quantifiers in the pattern, the number of available matches may be smaller, but the creator of the pattern should realize this. The problem situation is when PCRE does some auto-possessification behind the user's back - this won't always cause a problem, but it can, as we have seen, especially if more cases are added to the auto-possessification code (which is what has just happened). If the intention of the GLib function is always to provide all possible matches, then I would always recommend using PCRE_NO_AUTO_POSSESS. Philip -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
