Mathias: A good plan to separate out the:
- grammar operators - parse tree building - parser action code I agree with you, but in my case I want to have the grammar language totally independent of the implementation programming language. So the grammar has no parser action code, although this can be added later in a particular programming language implementation. Parboiled has much closer integration with the programming language, which has advantages, but I really want grammar specifications that are totally independent. So now for the grammar operators, and parse tree building annotations. For many years I kept these separate: tree building annotations in the rule head (ie they annotate the rule name) or definition syntax (= for interior nodes, : for leaf nodes or terminals). The right hand side of the rule had the grammar expression body withs grammar operations only. One more head notation allows pruning the tree. This works fine, and maybe I should have stuck to that separation, but I introduced the `x prefix notation because I found that it was an advantage to be able to see the children nodes that would be generated by looking at the parent rule alone, and not having to refer to the children rule definitions to see how they are annotated. I have found that about half of many grammars turn out to be leaf nodes, so the ":" rule definition notation has a huge payoff in tree size and performance. Because of this the `x rule is not vital, but over time I found it was convenient, and a good trade-off for me. Cheers, Peter. On Fri, Dec 10, 2010 at 11:44 AM, Mathias <math...@parboiled.org> wrote: > Over the course of developing parboiled (PEG parsing engine for Java and > Scala using internals DSLs for grammar specification) I found that it's best > to clearly separate the following three things: > - grammar operators > - parse tree building > - parser action code > > The operators, together with the rules they are applied to, define the > language your grammar recognizes. Nothing more, nothing less. > Whether your engine builds a parse tree for input of that language or > whether it doesn't should likely not be specified at the operator level, but > it some other way. > In "parboiled for Java" operators and rules are modeled as method calls > whereas parse tree building is controlled via annotations on those methods. > You can selectively enable or disable the creation of parse tree nodes per > grammar rule, which allows you to tweak parse tree building exactly to your > needs. > > The interface between the parsing engine and custom parser action code is > the thing that has changed the most over the course of parboileds life cycle > so far. > Initially parser action code had to access the parse tree nodes in order to > get to matched input. Creating a custom object structure during the parsing > run (e.g. an AST) was done by decorating the parse tree. So the parse tree > was the "work bench" of the parsing process. > Over time this heavy centering around the parse tree turned out to have two > problems: > > 1. Low Performance. > Always having to create a parse tree even when all you care about is the > AST is just wasteful. Especially since the parse tree can contain a huge > number of nodes for larger inputs (sometimes more nodes than input > characters). > > 2. Less room for automatic optimizations. > The structure of the parse tree is dictated by the structure of the grammar > rules. If your parser action code is built under the assumption that the > parse tree has a given structure there is little leeway for the parsing > engine to apply automatic rule optimizations. Decoupling the parser action > code from the parse tree opens up the possibility to apply all kinds of > automatic grammar tweaking before running the parser the first time. The > engine might decide to completely change the rule structure, as long as all > changes do not change the recognized language and are transparent to the > parser action code. > > Currently parboiled implements the interface between the parsing engine and > the action code in the following way: > 1. Actions can appear anywhere in a rule. > 2. Actions can access the matched input text of the sub rule immediately > preceding the action but not of any other rule (so there is no sub rule > labeling required). > 3. For working with custom objects (e.g. AST nodes) the engine provides a > "Value Stack", which is a simple stack structure that serves as a fast work > bench for a parsing run. Actions can push objects onto this stack, pop them > off, swap them around, and so on. > > This solution completely decouples the parse tree from everything else. You > can enable or disable parse tree building without any effect on the rest of > the parser. There is no need for addressing sub rules in action expressions > and given a somewhat efficient value stack implementation the whole thing is > quite fast. Additionally, in "parboiled for Scala", the action code with its > manipulations of the value stack can be statically type-checked at compile > time, which is a huge plus. > > In case you are interested in more details or broader explanations, the > parboiled documentation is quite complete. > > Cheers, > Mathias > > --- > math...@parboiled.org > http://www.parboiled.org > > On 09.12.2010, at 21:01, Alan Post wrote: > > > I'm working on my PEG parser, in particular the interface between > > the parse tree and the code one can attach to productions that > > are executed on a successful parse. > > > > I've arranged for the two predicate operations, & and !, to not add > > any output to the parse tree. That means that the following > > production: > > > > rule <- &a !b "c" > > > > Produces the same parse tree as: > > > > rule <- "c" > > > > Internally, this means that I recognize that the sequence operator > > (which contains the productions '&a', '!b', and '"c"' in this > > example) is being called with predicates in every position but one, > > and rather than returning a list containing that single element, > > I return just the single element. > > > > As I've been doing this, I've found that I want a new operator similar > > to '&'. '&' matches the production it is attached to, but it does not > > advance the position of the input buffer. > > > > I'd like an operator that matches the production it is attached to, > > advances the input buffer, but doesn't add anything to the parse > > tree. > > > > Here's an example: > > > > mulexp <- digit '*' digit EOF -> {(lambda (x y) (* x y))} > > > > the mulexp production is a sequence of four other rules, but only > > two of them are needed by the associated code. It would be nice > > if I could write the code rule like it is above, rather than say > > this: > > > > (lambda (x op y EOF) (* x y)) > > > > Having to account for all the rules in the sequence, but really > > only caring about two of them. Here is the example rewritten > > with '^' expressing "match the rule, advance the input, but don't > > modify the parse tree": > > > > mulexp <- digit ^'*' digit ^EOF -> {(lambda (x y) (* x y))} > > > > Before I go inventing syntax for this use case, will you tell me if > > this is already being done with other parsers? Have any of you had > > this problem and already solved it, and if so, what approach did you > > take? > > > > -Alan > > -- > > .i ko djuno fi le do sevzi > > > > _______________________________________________ > > PEG mailing list > > PEG@lists.csail.mit.edu > > https://lists.csail.mit.edu/mailman/listinfo/peg > > > _______________________________________________ > PEG mailing list > PEG@lists.csail.mit.edu > https://lists.csail.mit.edu/mailman/listinfo/peg >
_______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg