Way back in 2006, a bug was posted (RT 21765) that basically complained that
RecDescent parsers displayed exponential behavior in the length of the input
text.

I recently posted a patch allowing a workaround to actually work. But I was
curious as to what was causing the bad run times with large input.

I used a grammar inspired by the bug report...

---START test_grammar.prd--
<skip: qr{[ \t]*} >
FILE: line(s) /\z/
line: /[a-z]+/ '=' /[0-9]+/ /\n/
---END---

Yes, that is a "global" skip directive. Yes, I have a patch for it.

To keep the generation process out of the profiling stats, I precompiled the
grammar into a stand-alone module. I generated 10,000 lines (~135K) of test
data and fired up Devel::NYTProf

perl -d:NYTProf precomp_test_run.pl test_data

If you have never used NYTProf before, I would highly recommend it. It can
produce stats down to the opcode level. And it produces some very lovely
html pages with the line-by-line stats actually attached to the code.

The test case ran in 13.2 seconds on my old, slow machine.

The stats quickly revealed that the highest hotspot (by a large margin) was
Expectation::at(). The Expectation object is where the parser stores the
information necessary to print error message of the form:

"Expecting 'foobar' but found 'xyz  fred' instead"

Expectation::at() allows the parser to set the context - the 'xyz fred'
part. It does this by passing the entire input string - as it is at that
moment - to Expectation::at(). The subroutine then makes a copy of the
entire string. It was this copying that was eating so much time. This is a
bit wasteful to say the least. I modified Expectation::at() to only copy the
first 100 bytes of the passed text.

This dropped the running time down to 7.8 seconds or 60% of the original.
Not bad, but maybe we can do better.

The next hotspot was the several places that looked like:
$text =~ s/\A($skip)/$lastsep=$1 and ""/e

These are removing the "whitespace" (however you have defined that) from the
input. The /e modifier is very expensive as is the use of a capture group. I
could not find a place in the code that $lastsep was actually being used for
anything. So, I simplified this to
$text =~ s/\A$skip//

But, there is a problem. If @itempos is being used, then the capture group
is used to calculate the start of the next item. So, I modified the code
generation to make the capture group depend on the generation of the
@itempos code. Now, my test case could be fast, but if I needed @itempos in
a future grammar, it would be correct - or at least no wronger than before.

This dropped the running tine down to 3.8 seconds or about 30% of the
original. Not bad. Throughput has been increased by a factor of 3.

Feeling smug, and seeing no more easy pickings, I generated 50,000 (~720k)
lines of input and let 'er rip.

75.4 seconds. Drats. Still exponential.

Firing up NYTProf again, the next culprit was _parserepeat(). In particular
the line that saves off a copy of the input text in case of failure.
my $_savetext = $text;

But that doesn't seem necessary. _parserepeat() is only called in order to
call a rule. And the rule subroutines should already be taking all the
necessary precautions to restore the text. So away the $_savetext lines go
from _parserepeat().

That brings the 50K case down to 50.4 seconds or about 67% of the original
run.

But now I'm stuck. Of the 50 seconds, 6 is spent checking if the input
matches, 27 seconds is spent on $text manipulations (saving a local copy,
updating the caller's copy to match changes the rules have made), and 14.6
(!) is spent entirely on the return statement.

I think to truly get rid of the exponential behavior, PRD is going to have
to stop modifying the input text and instead use pos() and \G to check for
matches. and "rewind" failures. But I'm not sure how that would work in the
face of the desire to allow the grammar writer modify the text inside the
grammar. I've been toying with a model that makes $text a tied scalar so
that the parser can be notified when an action makes a change to it, but I'm
not convinced yet.

Any ideas? I'd love to see some discussion started...

Note: The patch can be found at http://mark.hollomon.us/PRD -- This is a
cumulative patch that contains fixes for several problems I've worked on in
the past few weeks.

Mark.

Reply via email to