Re: [PEG] and-predicate with slightly different behavior?

Alan Post Thu, 09 Dec 2010 14:58:38 -0800

Will you help me understand something about parboiled?

When I look at this code:


  https://github.com/sirthias/pegdown/blob/master/src/org/pegdown/Parser.java

Am I looking at something hand-written, or am I looking at something
compiled from a grammar definition?

I've been influenced by your assertion that I should separate code
and grammar.  My specific need is to automatically generate a parser
from a PEG description, which I have assumed means I need to attach
actions to the grammar, but that I should do so in a way that
doesn't tightly couple the action code to the parse tree.  Is this
something that parboiled does?  And if so, may I see an example of
the grammar that is used to compile a parser?

Thank you,

-Alan

On Thu, Dec 09, 2010 at 11:44:44PM +0100, Mathias wrote:
> Over the course of developing parboiled (PEG parsing engine for Java and 
> Scala using internals DSLs for grammar specification) I found that it's best 
> to clearly separate the following three things:
> - grammar operators
> - parse tree building
> - parser action code
> 
> The operators, together with the rules they are applied to, define the 
> language your grammar recognizes. Nothing more, nothing less.
> Whether your engine builds a parse tree for input of that language or whether 
> it doesn't should likely not be specified at the operator level, but it some 
> other way.
> In "parboiled for Java" operators and rules are modeled as method calls 
> whereas parse tree building is controlled via annotations on those methods. 
> You can selectively enable or disable the creation of parse tree nodes per 
> grammar rule, which allows you to tweak parse tree building exactly to your 
> needs.
> 
> The interface between the parsing engine and custom parser action code is the 
> thing that has changed the most over the course of parboileds life cycle so 
> far.
> Initially parser action code had to access the parse tree nodes in order to 
> get to matched input. Creating a custom object structure during the parsing 
> run (e.g. an AST) was done by decorating the parse tree. So the parse tree 
> was the "work bench" of the parsing process.
> Over time this heavy centering around the parse tree turned out to have two 
> problems:
> 
> 1. Low Performance.
> Always having to create a parse tree even when all you care about is the AST 
> is just wasteful. Especially since the parse tree can contain a huge number 
> of nodes for larger inputs (sometimes more nodes than input characters).
> 
> 2. Less room for automatic optimizations.
> The structure of the parse tree is dictated by the structure of the grammar 
> rules. If your parser action code is built under the assumption that the 
> parse tree has a given structure there is little leeway for the parsing 
> engine to apply automatic rule optimizations. Decoupling the parser action 
> code from the parse tree opens up the possibility to apply all kinds of 
> automatic grammar tweaking before running the parser the first time. The 
> engine might decide to completely change the rule structure, as long as all 
> changes do not change the recognized language and are transparent to the 
> parser action code.
> 
> Currently parboiled implements the interface between the parsing engine and 
> the action code in the following way:
> 1. Actions can appear anywhere in a rule.
> 2. Actions can access the matched input text of the sub rule immediately 
> preceding the action but not of any other rule (so there is no sub rule 
> labeling required).
> 3. For working with custom objects (e.g. AST nodes) the engine provides a 
> "Value Stack", which is a simple stack structure that serves as a fast work 
> bench for a parsing run. Actions can push objects onto this stack, pop them 
> off, swap them around, and so on.
> 
> This solution completely decouples the parse tree from everything else. You 
> can enable or disable parse tree building without any effect on the rest of 
> the parser. There is no need for addressing sub rules in action expressions 
> and given a somewhat efficient value stack implementation the whole thing is 
> quite fast. Additionally, in "parboiled for Scala", the action code with its 
> manipulations of the value stack can be statically type-checked at compile 
> time, which is a huge plus.
> 
> In case you are interested in more details or broader explanations, the 
> parboiled documentation is quite complete.
> 
> Cheers,
> Mathias
> 
> ---
> math...@parboiled.org
> http://www.parboiled.org
> 
> On 09.12.2010, at 21:01, Alan Post wrote:
> 
> > I'm working on my PEG parser, in particular the interface between
> > the parse tree and the code one can attach to productions that
> > are executed on a successful parse.
> > 
> > I've arranged for the two predicate operations, & and !, to not add
> > any output to the parse tree.  That means that the following
> > production:
> > 
> >  rule <- &a !b "c"
> > 
> > Produces the same parse tree as:
> > 
> >  rule <- "c"
> > 
> > Internally, this means that I recognize that the sequence operator
> > (which contains the productions '&a', '!b', and '"c"' in this
> > example) is being called with predicates in every position but one,
> > and rather than returning a list containing that single element,
> > I return just the single element.
> > 
> > As I've been doing this, I've found that I want a new operator similar
> > to '&'.  '&' matches the production it is attached to, but it does not
> > advance the position of the input buffer.
> > 
> > I'd like an operator that matches the production it is attached to,
> > advances the input buffer, but doesn't add anything to the parse
> > tree.
> > 
> > Here's an example:
> > 
> >  mulexp <- digit '*' digit EOF -> {(lambda (x y) (* x y))}
> > 
> > the mulexp production is a sequence of four other rules, but only
> > two of them are needed by the associated code.  It would be nice
> > if I could write the code rule like it is above, rather than say
> > this:
> > 
> >  (lambda (x op y EOF) (* x y))
> > 
> > Having to account for all the rules in the sequence, but really 
> > only caring about two of them.  Here is the example rewritten
> > with '^' expressing "match the rule, advance the input, but don't
> > modify the parse tree":
> > 
> >  mulexp <- digit ^'*' digit ^EOF -> {(lambda (x y) (* x y))}
> > 
> > Before I go inventing syntax for this use case, will you tell me if
> > this is already being done with other parsers?  Have any of you had
> > this problem and already solved it, and if so, what approach did you
> > take?
> > 
> > -Alan
> > -- 
> > .i ko djuno fi le do sevzi
> > 
> > _______________________________________________
> > PEG mailing list
> > PEG@lists.csail.mit.edu
> > https://lists.csail.mit.edu/mailman/listinfo/peg
> 

-- 
.i ko djuno fi le do sevzi

_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Re: [PEG] and-predicate with slightly different behavior?

Reply via email to