Quick status update.  I tried to prototype this in pure PHP in the
wikimedia/remex-html library using (?= .. ) around each regexp and ()...()
around each captured expression (replacing the capture parens) to
effectively bypass the string copy and return a bunch of zero-length
strings.  That didn't succeed in speeding up remex-html on my pet benchmark
because (1) the (?= ... ) appears to deoptimize the regexp match, and (2)
it turns out there's a substantial cost to each capture (presumably all
those two-element arrays which Nikita flagged before as a future issue) and
so doubling the total number of captures by using ()  () instead of (....)
slowed the match down.

So bad news: my benchmarking shortcut didn't work. Potential good news: I
guess that underlines why this feature is necessary and can't just be
emulated.

I'm going to try this benchmark again tomorrow but by rebuilding PHP from
source using Nikita's proposed patch so that I can actually get an
apples-to-apples comparison.
   --scott

On Thu, Mar 21, 2019 at 7:35 AM Nikita Popov <nikita....@gmail.com> wrote:

> On Wed, Mar 20, 2019 at 4:35 PM C. Scott Ananian <canan...@wikimedia.org>
> wrote:
>
>> On Tue, Mar 19, 2019 at 10:58 AM Nikita Popov <nikita....@gmail.com>
>> wrote:
>>
>>> After thinking about this some more, while this may be a minor
>>> performance improvement, it still does more work than necessary. In
>>> particular the use of OFFSET_CAPTURE (which would be pretty much required
>>> here) needs one new two-element array for each subpattern. If the captured
>>> strings are short, this is where the main cost is going to be.
>>>
>>
>> The primary use of this feature is when the captured strings are *long*,
>> as that's when we most want to avoid copying a substring.
>>
>>
>>> I'm wondering if we shouldn't consider a new object oriented API for
>>> PCRE which can return a match object where subpattern positions and
>>> contents can be queried via method calls, so you only pay for the parts
>>> that you do access.
>>>
>>
>> Seems like this is letting the perfect be the enemy of the good.  The
>> LENGTH_CAPTURE significantly reduces allocation for long match strings, and
>> it allocates the same two-element arrays that OFFSET_CAPTURE would -- it
>> just stores an integer where there would otherwise be an expensive
>> substring.  Furthermore, since the array structure is left mostly alone, it
>> would be not-too-hard to support earlier-PHP versions, with something like:
>>
>> $hasLengthCapture = defined('PREG_LENGTH_CAPTURE') ? PREG_LENGTH_CAPTURE
>> : 0;
>> $r = preg_match($pat, $sub, $m, PREG_OFFSET_CAPTURE | $hasLengthCapture);
>> $matchOneLength = $hasLengthCapture ? $m[1][0] : strlen($m[1][0]);
>> $matchOneOffset = $m[1][1];
>>
>> If you introduce a whole new OO accessor object, it starts becoming very
>> hard to write backward-compatible code.
>>  --scott
>>
>
> Fair enough. I've created https://github.com/php/php-src/pull/3971 to
> implement this feature. It would be good to have some confirmation that
> this is really a significant performance improvement before we land it
> though.
>
> Nikita
>


-- 
(http://cscott.net)

Reply via email to