Over the course of developing parboiled (PEG parsing engine for Java and Scala 
using internals DSLs for grammar specification) I found that it's best to 
clearly separate the following three things:
- grammar operators
- parse tree building
- parser action code

The operators, together with the rules they are applied to, define the language 
your grammar recognizes. Nothing more, nothing less.
Whether your engine builds a parse tree for input of that language or whether 
it doesn't should likely not be specified at the operator level, but it some 
other way.
In "parboiled for Java" operators and rules are modeled as method calls whereas 
parse tree building is controlled via annotations on those methods. You can 
selectively enable or disable the creation of parse tree nodes per grammar 
rule, which allows you to tweak parse tree building exactly to your needs.

The interface between the parsing engine and custom parser action code is the 
thing that has changed the most over the course of parboileds life cycle so far.
Initially parser action code had to access the parse tree nodes in order to get 
to matched input. Creating a custom object structure during the parsing run 
(e.g. an AST) was done by decorating the parse tree. So the parse tree was the 
"work bench" of the parsing process.
Over time this heavy centering around the parse tree turned out to have two 
problems:

1. Low Performance.
Always having to create a parse tree even when all you care about is the AST is 
just wasteful. Especially since the parse tree can contain a huge number of 
nodes for larger inputs (sometimes more nodes than input characters).

2. Less room for automatic optimizations.
The structure of the parse tree is dictated by the structure of the grammar 
rules. If your parser action code is built under the assumption that the parse 
tree has a given structure there is little leeway for the parsing engine to 
apply automatic rule optimizations. Decoupling the parser action code from the 
parse tree opens up the possibility to apply all kinds of automatic grammar 
tweaking before running the parser the first time. The engine might decide to 
completely change the rule structure, as long as all changes do not change the 
recognized language and are transparent to the parser action code.

Currently parboiled implements the interface between the parsing engine and the 
action code in the following way:
1. Actions can appear anywhere in a rule.
2. Actions can access the matched input text of the sub rule immediately 
preceding the action but not of any other rule (so there is no sub rule 
labeling required).
3. For working with custom objects (e.g. AST nodes) the engine provides a 
"Value Stack", which is a simple stack structure that serves as a fast work 
bench for a parsing run. Actions can push objects onto this stack, pop them 
off, swap them around, and so on.

This solution completely decouples the parse tree from everything else. You can 
enable or disable parse tree building without any effect on the rest of the 
parser. There is no need for addressing sub rules in action expressions and 
given a somewhat efficient value stack implementation the whole thing is quite 
fast. Additionally, in "parboiled for Scala", the action code with its 
manipulations of the value stack can be statically type-checked at compile 
time, which is a huge plus.

In case you are interested in more details or broader explanations, the 
parboiled documentation is quite complete.

Cheers,
Mathias

---
math...@parboiled.org
http://www.parboiled.org

On 09.12.2010, at 21:01, Alan Post wrote:

> I'm working on my PEG parser, in particular the interface between
> the parse tree and the code one can attach to productions that
> are executed on a successful parse.
> 
> I've arranged for the two predicate operations, & and !, to not add
> any output to the parse tree.  That means that the following
> production:
> 
>  rule <- &a !b "c"
> 
> Produces the same parse tree as:
> 
>  rule <- "c"
> 
> Internally, this means that I recognize that the sequence operator
> (which contains the productions '&a', '!b', and '"c"' in this
> example) is being called with predicates in every position but one,
> and rather than returning a list containing that single element,
> I return just the single element.
> 
> As I've been doing this, I've found that I want a new operator similar
> to '&'.  '&' matches the production it is attached to, but it does not
> advance the position of the input buffer.
> 
> I'd like an operator that matches the production it is attached to,
> advances the input buffer, but doesn't add anything to the parse
> tree.
> 
> Here's an example:
> 
>  mulexp <- digit '*' digit EOF -> {(lambda (x y) (* x y))}
> 
> the mulexp production is a sequence of four other rules, but only
> two of them are needed by the associated code.  It would be nice
> if I could write the code rule like it is above, rather than say
> this:
> 
>  (lambda (x op y EOF) (* x y))
> 
> Having to account for all the rules in the sequence, but really 
> only caring about two of them.  Here is the example rewritten
> with '^' expressing "match the rule, advance the input, but don't
> modify the parse tree":
> 
>  mulexp <- digit ^'*' digit ^EOF -> {(lambda (x y) (* x y))}
> 
> Before I go inventing syntax for this use case, will you tell me if
> this is already being done with other parsers?  Have any of you had
> this problem and already solved it, and if so, what approach did you
> take?
> 
> -Alan
> -- 
> .i ko djuno fi le do sevzi
> 
> _______________________________________________
> PEG mailing list
> PEG@lists.csail.mit.edu
> https://lists.csail.mit.edu/mailman/listinfo/peg


_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Reply via email to