I have just committed a patch that makes some small changes to the way 
partial matches are handled in the interpreter. I hope Zoltán will in 
due course pick these up for the JIT. (There are new tests at the end of
testinput2 which have no_jit set at the moment.)

The changes are really quite small, but I think they address some of the 
issues that have been discussed at length here. I am sorry that it has 
taken so long for me to see the essential points, but I think I do how 
have the general principles clear. The result is, in fact, two very 
minor changes:

(1) The "must have inspected at least one character" condition for 
recognizing a partial match is now extended with "OR the pattern 
must contain a lookbehind of non-zero length". This applies to both hard
and soft partial matches. These two conditions ensure that a partial
match is recognized when there is a possibility that adding more
characters may enable a complete match to be found.

Interestingly, I discovered that I had documented this situation already
when (in pcre2partial) I wrote:

  For this reason, a "no match" result should be interpreted as "partial
  match of an empty string" when the pattern contains lookbehinds.

This sentence has now been removed from pcre2partial because an empty 
partial match is now given.

(2) It was already documented that \z and \Z should not match at the end 
of a subject if PCRE2_PARTIAL_HARD is set. This was not working when no 
characters had been inspected (and, after (1) was implemented, still not 
working for non-lookbehind patterns). I have made patterns such as /\z/ 
give appropriate partial matches.


Further points:

On Fri, 19 Jul 2019, ND via Pcre-dev wrote:

> Alternative suggestion may be:

<snip>

> Disadvantages:
> 1. It may be a breaking change.

Indeed, and that is one reason I have not done it. Also, the changes I 
*have* made are very small, which I like. :-)


Finally:

There is still the problem of patterns for which a return value "no 
match in this segment and it will never match however many more 
characters are added" would be useful. ND quoted /(*COMMIT)(*F)/ as a
simple example. Is (*COMMIT) the only way this might happen?

There is an item on the Wish List requesting a way of determining
whether a match was failed by a start-up optimization or by running a
matching engine. I haven't done anything about it because it would
require JIT work.

What could be done is to add a new field to the match data that records 
why a match failed. A new function (e.g. pcre2_get_fail_reason) could 
return this to the user. Possible returns could be:

  PCRE2_FAILEDBY_START_OPTIMIZATION
  PCRE2_FAILEDBY_INTERPRETER
  PCRE2_FAILEDBY_INTERPRETER_COMMIT  

  PCRE2_FAILEDBY_JIT_START_OPTIMIZATION
  PCRE2_FAILEDBY_JIT
  PCRE2_FAILEDBY_JIT_COMMIT  

  PCRE2_FAILEDBY_DFA_START_OPTIMIZATION
  PCRE2_FAILEDBY_DFA_INTERPRETER

Could this be useful?

Philip

-- 
Philip Hazel
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 

Reply via email to