Re: [PEG] Paper on bottom-up, right-to-left PEG parsing

Luke Hutchison Fri, 17 Jul 2020 18:37:59 -0700

Thanks for the feedback, Aaron! I'll incorporate your feedback to improve
the paper.

Yes, it's hard to beat recursive descent parsing, not just because
the algorithm scales linearly with very low overhead, but also because of
its simplicity. Many people assume left recursion is a solved problem for
recursive descent parsing, since even in Ford's original thesis, rewriting
heuristics were shown for turning left-recursive rules into
non-left-recursive rules. However, this can get extremely complex when you
have multiple nested indirect parsing loops, and the resulting AST can
deviate quite far from the intended structure.

As far as memory usage -- the difference in runtime between a fully
memoized recursive descent parser and a bottom-up DP (pika) parser will be
directly linearly correlated with the difference in memory consumption
between the two algorithms. (The paper points this out, I think, but it was
probably easy to miss this point.)

You said: "Given that recursion in the grammar may result in the same
clause matching multiple times at the same position (as long as the new
match is longer), I think the worst case-bound on h is O(n), not O(1),
which would make pika parsing’s worst-case time bound O(n^2)." Actually
this is not the case, otherwise recursive descent parsers would also be
O(n^2) if their grammar contained cycles. The reason is subtle: The size of
the parse tree produced by any parser is linearly correlated with the
length of the input, *regardless of the depth of the parse tree or the
number of loops around a cycle in the grammar*, because at the limit, there
would be one node in the parse tree per character, even if the parse tree
degenerated into a singly-linked list. This is why in the paper, I always
refer to recursive descent parsing as taking time "linear in the length of
the input and the depth of the parse tree", because actually those two
things represent the same thing. (i.e. you don't multiply the length of the
input by the depth of the parse tree to obtain the final time complexity,
you equate them to each other).

It may be worth it to show an intermediate example between the two
extremes, good suggestion. (However, the paper is already pushing the
limits of how long a paper can realistically be!)

I should also point out that I noticed the other thread on this list about
Mouse by Roman Redz ('[PEG] Left recursion in "Mouse" '). Mouse solves left
recursion the same way as the pika parser, by parsing bottom-up. I believe
this hybrid approach may actually be a better solution in general than the
pika parser, because of its higher efficiency, and even before I posted to
this PEG list I had already started thinking about doing something very
similar. However, at least when I first looked at Mouse, I didn't
understand all the implications of the Mouse heuristics for bottom-up
parsing of left-recursive rules (in particular, how it would work with
multiple nested loops around cycles in the grammar), and I had previously
arrived at some very different heuristics for solving this problem,
involving pushing onto a stack of nested cycles. I need to implement my
alternative to Mouse to see if the heuristics do in fact work, then compare
to Mouse.

All the best,
Luke

On Thu, Jul 9, 2020 at 12:17 AM Moss, Aaron <mo...@up.edu> wrote:

> Fascinating work, I’m always interested to see a new PEG parsing
> algorithm, and handling left-recursion as cleanly as you did is impressive.
>
>
>
> I’m not sure if it would make a difference, but it might be interesting to
> use just the seed parent relationship rather than the subclause
> relationship in your pseudo-topological sort of clauses – any clause other
> than the first one will be in a later column, and thus available in the
> bottom-up dynamic programming order you defined. (for what it’s worth, I
> found the textual description of the modifications to topological sort to
> handle cycles somewhat confusing (though clarified by the code listing),
> but I don’t have any concrete suggestions for improvement).
>
>
>
> I’d also be interested to see performance results for a grammar of
> intermediate complexity between your expression grammar and Java. I have a
> JSON grammar in the repository for my own experimental parsing
> algorithm[1], along with some test input (hat tip to Kota Mizushima for
> providing both). One of the tricky things about new PEG parsing algorithms
> is that recursive descent is actually a linear-time algorithm for many
> practical inputs [2, s.5.1], so it’s hard to beat it. I’d also be curious
> to see a comparison of memory usage between your algorithm and ANTLR4 and
> Parboiled – recursive descent is often faster in practice than straight
> packrat, and a lot of the reason is packrat’s much-higher memory use.
>
>
>
> [1] https://github.com/bruceiv/egg/tree/deriv/perf_test
>
> [2] http://eptcs.web.cse.unsw.edu.au/paper.cgi?AFL2017.18.pdf
>
>
>
> One concern I have is that your claimed O(n) runtime depends on overhead
> of spurious matches being constant per character, and doesn’t account for
> the left-recursion features you’ve added to the grammar. (I didn’t do a
> re-read of your whole paper after noticing this, but it’s possible you
> avoided claiming it’s a worst-case bound, though your language does suggest
> that interpretation.). Given that recursion in the grammar may result in
> the same clause matching multiple times at the same position (as long as
> the new match is longer), I think the worst case-bound on h is O(n), not
> O(1), which would make pika parsing’s worst-case time bound O(n^2). That
> said, trimming a linear factor off the O(n^3) bound for the general-purpose
> CFG parsing algorithms is still very impressive, and (given that grammar
> recursion is generally bounded by a small constant in practice), I expect
> pika parsing to be linear-time in practice. [I found your discussion of
> grammar looping conflated the issue of left-recursion and grammar looping
> in a somewhat confusing fashion, when whether or not the recursion is at
> the same position in the input is an important distinction. My comments on
> your performance bounds may only apply to left-recursive grammars.]
>
>
>
> Also, there’s a small typo on p.18: "parsing speed for the expression
> grammar ... is approximately 8.9 times higher than the parsing speed for
> the [Java] grammar"
>
>
>
> Again, neat work, and a fascinating paper.
>
>
>
> Best,
>
> Aaron Moss
>
>
>
> *From:* PEG [mailto:peg-boun...@lists.csail.mit.edu] * On Behalf Of *Luke
> Hutchison
> *Sent:* Tuesday, July 7, 2020 23:22
> *To:* peg@lists.csail.mit.edu
> *Subject:* [PEG] Paper on bottom-up, right-to-left PEG parsing
>
>
>
> *EXTERNAL EMAIL:* This email originated from outside the University of
> Portland. Do not click links or open attachments unless you recognize the
> sender and know the content is safe.
> ------------------------------
>
> I published a preprint of the following paper on arXiv.
>
>
>
> After writing the paper, I realized that Bryan Ford briefly described the
> idea behind this algorithm in his 2002 Master's thesis; however, as far as
> I can tell, he did not pursue the idea to implementation, preferring the
> higher efficiency of packrat parsing.
>
>
>
> I submitted this to ACM TOPLAS for review. Feedback welcome.
>
>
>
> Luke Hutchison
>
>
>
> ---
>
>
>
> *Pika parsing: reformulating packrat parsing as a dynamic programming
> algorithm solves the left recursion and error recovery problems*
>
> *Abstract:* A recursive descent parser is built from a set of
> mutually-recursive functions, where each function directly implements one
> of the nonterminals of a grammar. A packrat parser uses memoization to
> reduce the time complexity for recursive descent parsing from exponential
> to linear in the length of the input. Recursive descent parsers are
> extremely simple to write, but suffer from two significant problems: (i)
> left-recursive grammars cause the parser to get stuck in infinite
> recursion, and (ii) it can be difficult or impossible to optimally recover
> the parse state and continue parsing after a syntax error. Both problems
> are solved by the pika parser, a novel reformulation of packrat parsing as
> a dynamic programming algorithm, which requires parsing the input in
> reverse: bottom-up and right to left, rather than top-down and left to
> right. This reversed parsing order enables pika parsers to handle grammars
> that use either direct or indirect left recursion to achieve left
> associativity, simplifying grammar writing, and also enables optimal
> recovery from syntax errors, which is a crucial property for IDEs and
> compilers. Pika parsing maintains the linear-time performance
> characteristics of packrat parsing as a function of input length. The pika
> parser was benchmarked against the widely-used Parboiled2 and ANTLR4
> parsing libraries. The pika parser performed significantly better than the
> other parsers for an expression grammar, although for a complex grammar
> implementing the Java language specification, a large constant performance
> impact was incurred per input character. Therefore, if performance is
> important, pika parsing is best applied to simple to moderate-sized
> grammars, or to very large inputs, if other parsing alternatives do not
> scale linearly in the length of the input. Several new insights into
> precedence, associativity, and left recursion are presented.
>
> *Preprint:* https://arxiv.org/abs/2005.06444
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Farxiv.org%2Fabs%2F2005.06444&data=02%7C01%7Cmossa%40up.edu%7C3f24e5e7d2a6499eeba008d82307700d%7Cea8f3949231c40b6a33f56873af96f87%7C0%7C1%7C637297862149291999&sdata=%2FyADiuBgL3jYF0W0fQQSd0%2F47%2F0NheHpq8mRsrZ4oq0%3D&reserved=0>
>
>
>

_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Re: [PEG] Paper on bottom-up, right-to-left PEG parsing

Reply via email to