On 10.12.2014 12:30, [email protected] wrote:

In various parts, PCRE2 runs the following to check if a substring number if
valid:

   if (stringnumber >= match_data->oveccount ||
       stringnumber > match_data->code->top_bracket ||
       match_data->ovector[stringnumber*2] == PCRE2_UNSET)

I wonder if it is indeed necessary to compare with top_bracket. Rationale is
this comment in pcre2_match.c:

   If there is space in the offset vector, set any unused pairs at the
   end to PCRE2_UNSET for backwards compatibility.

Provided that the above holds true, should it not be sufficient to test for
PCRE2_UNSET?

I don't think so, because by "unused pairs" it actually means "unused
pairs up to the maximum capturing groups in the pattern". Consider the
case when oveccount is large (say 100), and pattern contains, say, 5
capturing parentheses. Suppose a match operation passes through only
groups 1,2,3. In that case, the code that follows that comment will set
the vector for groups 4 and 5 to PCRE2_UNSET. However, what if
stringnumber is set to 20?

Obviously, I should make that comment in pcre2_match.c a bit clearer.

My assumption was that stringnumber 6 to 20 would also be set to PCRE2_UNSET. It was not clear to me from reading the comment that they are left untouched.

If the behavior is like you described, the additional test for top_bracket is of course necessary.

In addition, the code snippet makes pcre2_match_data depend on pcre2_code. If
pcre2_code is freed before pcre2_match_data, the outcome of the code snipped
is undetermined. I have searched the documentation, but have not found it
mentioning this.

That is a good point. The documentation should tell you not to free the
code while still making use of the match data.

+1

A related thought:

The above code extract reappears identically multiple times in
pcre2_substring.c. Would it make sense to refactor substring validity checking
into its own, dedicated function?

I would be more tempted to make it into a macro.

+1

IMO, this would also be a welcome addition to the public API.

You mean something like:

int pcre2_substring_isset(match_data, stringnumber)

Exactly.

? Is this anything more than just a cosmetic version of
pcre2_substring_length_by{number,name}() ? For those two functions, we
could add the specification that sizeptr could be NULL, in which case
they are exactly "test for this group being set".

The dedicated and properly named function would probably be easier to find by developers. And it would perform slightly faster. Maybe insignificantly so, but many small improvements add up.

Other than that, your suggestion sounds fine!

Ralf

--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

Reply via email to