On Thu, 15 Oct 2009, Eldar Kleiner wrote:

> We were wondering what do you think about adding the following pcre behviour
> when multi segment matching is used with pcre_dfa_exec():
> 
> At the moment, patterns that contain alternatives at the top level are
> somehow problematic as only alternatives that match at lowest point in the
> subject are remembered. As a result matches might be missed for alternatives
> which share the same characters.
> 
> We thought it might be useful to be able to define the number ?saved? points
> or at least return an indication that matches might be missed. (and then let
> the caller of pcre_dfa_exec) decides if linearization should be used to find
> possible lost matches.

The Bad News:

The problem here is that pcre_dfa_exec() looks no further along the 
subject once it has found a partial match. Even if it did, there is 
nowhere for it to remember anything. I became even more aware of this
recently when doing some work on partial matching for pcre_exec() (*not*
pcre_dfa_exec()).

The Slightly Better News:

I am hoping to release 8.00 early next week. This contains improvements 
to partial matching with pcre_exec() (removing all restrictions and 
giving feedback about the partially matched string). It should now be
possible to do a kind of multi-segment matching using pcre_exec()
provided that the matched string is short enough to fit in your buffer.

For example, roughly speaking, if your buffer contains "xxxxxxxmmm"
where "mmm" is the partial match at the end of the buffer, you can throw
away "xxxxxxxx", move "mmm" to the start of the buffer, refill the 
buffer, and then try the whole match again with "mmmyyyyyy". If it 
doesn't match at the start, it will move on as normal, so you would 
find, for example "myy" if that was a match.

This technique can, of course, also be used with pcre_dfa_exec(). I said 
"roughly speaking" above because there is an issue with lookbehinds. The 
code for both functions now passes back the offset of the earliest 
character inspected during the partial match (which might be earlier 
than the actual match point), but it is theoretically possible for a
lookbehind later in the pattern to look back even further. Some 
protection against this happening is to leave extra characters at the 
start of the buffer. It doesn't seem to me to be very likely that people 
would actually use patterns that match a few characters and then 
lookbehind beyond what has already been matched, but I have learned 
never to underestimate what actually gets done. :-)

In the documentation for the 8.00 release I have written this:

  4. Patterns that contain alternatives at the top level which do not
  all start with the same pattern item may not work as expected when
  pcre_dfa_exec() is used. For example, consider this pattern:
  
    1234|3789
  
  If the first part of the subject is "ABC123", a partial match of the
  first alternative is found at offset 3. There is no partial match for
  the second alternative, because such a match does not start at the
  same point in the subject string. Attempting to continue with the
  string "7890" does not yield a match because only those alternatives
  that match at one point in the subject are remembered. The problem
  arises because the start of the second alternative matches within the
  first alternative. There is no problem with anchored patterns or
  patterns such as:
  
    1234|ABCD
  
  where no string can be a partial match for both alternatives. This is
  not a problem if pcre_exec() is used, because the entire match has to
  be rerun each time.

I will add a sentence pointing out that this approach can also be used 
with pcre_dfa_exec().

If you want to try the 8.00 release candidate, you will find it here:

ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Testing/pcre-8.00-RC1.tar.gz
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Testing/pcre-8.00-RC1.tar.bz2
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Testing/pcre-8.00-RC1.zip

Philip

-- 
Philip Hazel

-- 
## List details at http://lists.exim.org/mailman/listinfo/pcre-dev 

Reply via email to