Over the course of developing parboiled (PEG parsing engine for Java and Scala using internals DSLs for grammar specification) I found that it's best to clearly separate the following three things: - grammar operators - parse tree building - parser action code
The operators, together with the rules they are applied to, define the language your grammar recognizes. Nothing more, nothing less. Whether your engine builds a parse tree for input of that language or whether it doesn't should likely not be specified at the operator level, but it some other way. In "parboiled for Java" operators and rules are modeled as method calls whereas parse tree building is controlled via annotations on those methods. You can selectively enable or disable the creation of parse tree nodes per grammar rule, which allows you to tweak parse tree building exactly to your needs. The interface between the parsing engine and custom parser action code is the thing that has changed the most over the course of parboileds life cycle so far. Initially parser action code had to access the parse tree nodes in order to get to matched input. Creating a custom object structure during the parsing run (e.g. an AST) was done by decorating the parse tree. So the parse tree was the "work bench" of the parsing process. Over time this heavy centering around the parse tree turned out to have two problems: 1. Low Performance. Always having to create a parse tree even when all you care about is the AST is just wasteful. Especially since the parse tree can contain a huge number of nodes for larger inputs (sometimes more nodes than input characters). 2. Less room for automatic optimizations. The structure of the parse tree is dictated by the structure of the grammar rules. If your parser action code is built under the assumption that the parse tree has a given structure there is little leeway for the parsing engine to apply automatic rule optimizations. Decoupling the parser action code from the parse tree opens up the possibility to apply all kinds of automatic grammar tweaking before running the parser the first time. The engine might decide to completely change the rule structure, as long as all changes do not change the recognized language and are transparent to the parser action code. Currently parboiled implements the interface between the parsing engine and the action code in the following way: 1. Actions can appear anywhere in a rule. 2. Actions can access the matched input text of the sub rule immediately preceding the action but not of any other rule (so there is no sub rule labeling required). 3. For working with custom objects (e.g. AST nodes) the engine provides a "Value Stack", which is a simple stack structure that serves as a fast work bench for a parsing run. Actions can push objects onto this stack, pop them off, swap them around, and so on. This solution completely decouples the parse tree from everything else. You can enable or disable parse tree building without any effect on the rest of the parser. There is no need for addressing sub rules in action expressions and given a somewhat efficient value stack implementation the whole thing is quite fast. Additionally, in "parboiled for Scala", the action code with its manipulations of the value stack can be statically type-checked at compile time, which is a huge plus. In case you are interested in more details or broader explanations, the parboiled documentation is quite complete. Cheers, Mathias --- math...@parboiled.org http://www.parboiled.org On 09.12.2010, at 21:01, Alan Post wrote: > I'm working on my PEG parser, in particular the interface between > the parse tree and the code one can attach to productions that > are executed on a successful parse. > > I've arranged for the two predicate operations, & and !, to not add > any output to the parse tree. That means that the following > production: > > rule <- &a !b "c" > > Produces the same parse tree as: > > rule <- "c" > > Internally, this means that I recognize that the sequence operator > (which contains the productions '&a', '!b', and '"c"' in this > example) is being called with predicates in every position but one, > and rather than returning a list containing that single element, > I return just the single element. > > As I've been doing this, I've found that I want a new operator similar > to '&'. '&' matches the production it is attached to, but it does not > advance the position of the input buffer. > > I'd like an operator that matches the production it is attached to, > advances the input buffer, but doesn't add anything to the parse > tree. > > Here's an example: > > mulexp <- digit '*' digit EOF -> {(lambda (x y) (* x y))} > > the mulexp production is a sequence of four other rules, but only > two of them are needed by the associated code. It would be nice > if I could write the code rule like it is above, rather than say > this: > > (lambda (x op y EOF) (* x y)) > > Having to account for all the rules in the sequence, but really > only caring about two of them. Here is the example rewritten > with '^' expressing "match the rule, advance the input, but don't > modify the parse tree": > > mulexp <- digit ^'*' digit ^EOF -> {(lambda (x y) (* x y))} > > Before I go inventing syntax for this use case, will you tell me if > this is already being done with other parsers? Have any of you had > this problem and already solved it, and if so, what approach did you > take? > > -Alan > -- > .i ko djuno fi le do sevzi > > _______________________________________________ > PEG mailing list > PEG@lists.csail.mit.edu > https://lists.csail.mit.edu/mailman/listinfo/peg _______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg