Will you help me understand something about parboiled? When I look at this code:
https://github.com/sirthias/pegdown/blob/master/src/org/pegdown/Parser.java Am I looking at something hand-written, or am I looking at something compiled from a grammar definition? I've been influenced by your assertion that I should separate code and grammar. My specific need is to automatically generate a parser from a PEG description, which I have assumed means I need to attach actions to the grammar, but that I should do so in a way that doesn't tightly couple the action code to the parse tree. Is this something that parboiled does? And if so, may I see an example of the grammar that is used to compile a parser? Thank you, -Alan On Thu, Dec 09, 2010 at 11:44:44PM +0100, Mathias wrote: > Over the course of developing parboiled (PEG parsing engine for Java and > Scala using internals DSLs for grammar specification) I found that it's best > to clearly separate the following three things: > - grammar operators > - parse tree building > - parser action code > > The operators, together with the rules they are applied to, define the > language your grammar recognizes. Nothing more, nothing less. > Whether your engine builds a parse tree for input of that language or whether > it doesn't should likely not be specified at the operator level, but it some > other way. > In "parboiled for Java" operators and rules are modeled as method calls > whereas parse tree building is controlled via annotations on those methods. > You can selectively enable or disable the creation of parse tree nodes per > grammar rule, which allows you to tweak parse tree building exactly to your > needs. > > The interface between the parsing engine and custom parser action code is the > thing that has changed the most over the course of parboileds life cycle so > far. > Initially parser action code had to access the parse tree nodes in order to > get to matched input. Creating a custom object structure during the parsing > run (e.g. an AST) was done by decorating the parse tree. So the parse tree > was the "work bench" of the parsing process. > Over time this heavy centering around the parse tree turned out to have two > problems: > > 1. Low Performance. > Always having to create a parse tree even when all you care about is the AST > is just wasteful. Especially since the parse tree can contain a huge number > of nodes for larger inputs (sometimes more nodes than input characters). > > 2. Less room for automatic optimizations. > The structure of the parse tree is dictated by the structure of the grammar > rules. If your parser action code is built under the assumption that the > parse tree has a given structure there is little leeway for the parsing > engine to apply automatic rule optimizations. Decoupling the parser action > code from the parse tree opens up the possibility to apply all kinds of > automatic grammar tweaking before running the parser the first time. The > engine might decide to completely change the rule structure, as long as all > changes do not change the recognized language and are transparent to the > parser action code. > > Currently parboiled implements the interface between the parsing engine and > the action code in the following way: > 1. Actions can appear anywhere in a rule. > 2. Actions can access the matched input text of the sub rule immediately > preceding the action but not of any other rule (so there is no sub rule > labeling required). > 3. For working with custom objects (e.g. AST nodes) the engine provides a > "Value Stack", which is a simple stack structure that serves as a fast work > bench for a parsing run. Actions can push objects onto this stack, pop them > off, swap them around, and so on. > > This solution completely decouples the parse tree from everything else. You > can enable or disable parse tree building without any effect on the rest of > the parser. There is no need for addressing sub rules in action expressions > and given a somewhat efficient value stack implementation the whole thing is > quite fast. Additionally, in "parboiled for Scala", the action code with its > manipulations of the value stack can be statically type-checked at compile > time, which is a huge plus. > > In case you are interested in more details or broader explanations, the > parboiled documentation is quite complete. > > Cheers, > Mathias > > --- > math...@parboiled.org > http://www.parboiled.org > > On 09.12.2010, at 21:01, Alan Post wrote: > > > I'm working on my PEG parser, in particular the interface between > > the parse tree and the code one can attach to productions that > > are executed on a successful parse. > > > > I've arranged for the two predicate operations, & and !, to not add > > any output to the parse tree. That means that the following > > production: > > > > rule <- &a !b "c" > > > > Produces the same parse tree as: > > > > rule <- "c" > > > > Internally, this means that I recognize that the sequence operator > > (which contains the productions '&a', '!b', and '"c"' in this > > example) is being called with predicates in every position but one, > > and rather than returning a list containing that single element, > > I return just the single element. > > > > As I've been doing this, I've found that I want a new operator similar > > to '&'. '&' matches the production it is attached to, but it does not > > advance the position of the input buffer. > > > > I'd like an operator that matches the production it is attached to, > > advances the input buffer, but doesn't add anything to the parse > > tree. > > > > Here's an example: > > > > mulexp <- digit '*' digit EOF -> {(lambda (x y) (* x y))} > > > > the mulexp production is a sequence of four other rules, but only > > two of them are needed by the associated code. It would be nice > > if I could write the code rule like it is above, rather than say > > this: > > > > (lambda (x op y EOF) (* x y)) > > > > Having to account for all the rules in the sequence, but really > > only caring about two of them. Here is the example rewritten > > with '^' expressing "match the rule, advance the input, but don't > > modify the parse tree": > > > > mulexp <- digit ^'*' digit ^EOF -> {(lambda (x y) (* x y))} > > > > Before I go inventing syntax for this use case, will you tell me if > > this is already being done with other parsers? Have any of you had > > this problem and already solved it, and if so, what approach did you > > take? > > > > -Alan > > -- > > .i ko djuno fi le do sevzi > > > > _______________________________________________ > > PEG mailing list > > PEG@lists.csail.mit.edu > > https://lists.csail.mit.edu/mailman/listinfo/peg > -- .i ko djuno fi le do sevzi _______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg