On Tue, 24 Jan 2012, ND wrote:

> May be it can be useful to combine our approaches to reduce the number of
> cases when zero-length partial match may occur?
> So, zero-length partial match is allowed when:
>  1. it arises within a lookahead
> AND
>  2. pattern have lookbehind with non-zero length

I knew I had to think about this more. My idea was wrong. You don't need 
lookahead assertions. Consider /b(?<=..)/ and "abc". There are probably 
many different examples one can construct. The key thing, of course, is 
the lookbehind.

The documentation for partial matching says this:

  2. Lookbehind assertions at the start of a pattern are catered for in
  the offsets that are returned for a partial match. However, in theory,
  a lookbehind assertion later in the pattern could require even earlier
  characters to be inspected, and it might not have been reached when a
  partial match occurs. This is probably an extremely unlikely case; you
  could guard against it to a certain extent by always including extra
  characters at the start.
  
Obviously, this situation isn't as "extremely unlikely" as I thought... 

I have had the following ideas:

An unanchored pattern could always give a zero-length partial match at 
the end of the string. I chose not to do this, because at the time I did 
not think it was ever useful. It turns out that it is useful, but only 
if there is a lookbehind later in the pattern.

However, this means that "no match" actually means "zero-length partial 
match at end of string". What does an application that is doing 
multi-segment matching do with these results?

No match:       get next segment and start matching at the start.
Partial match:  get next segment and start matching at the start.

In other words, *the same thing*. However, what matters is whether any 
of the previous segment is retained, in case there is a lookbehind.

PCRE could tell the application the maximum lookbehind length, but what 
it cannot tell is whether there is a lookbehind further along the path 
that is being matched. So I don't think it should change its result.
However, an application can choose to treat "no match" as "partial
match", and retain some characters from the previous segment. So I think 
the code could be something like this:

  IF hard partial matching AND not anchored[1] AND no match THEN
    Retain maximum lookbehind length[2] in current segment
    Join next segment
    Match again, with start_offset set to point to next segment
    
[1] PCRE_INFO_OPTIONS can be used to find if anchored.
[2] PCRE_INFO_MAX_LOOKBEHIND does not exist, but it could quite easily
    be implemented. Until it is, you could just guess a suitable number.   

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 

Reply via email to