Peter Goodman wrote: >Given the grammar: > >Atom > : -<number> /* note: this is a non-excludable */ > ; > >Additive > : ^Additive '+' Atom > : Atom > ; > >And the sequence of tokens: 1 + 2 + 3, the untransformed parse tree >appears as: > > Additive > / | \ > Additive + Atom > / | \ | > Additive + Atom 3 > | | > Atom 2 > | > 1 > >The automatic reductions raise up the numbers in place of the Atom >as follows, and the '+' tokens are dropped as they are excludable: > > Additive > / \ > Additive 3 > / \ > Additive 2 > | > 1 > >And finally the ^ raises the children of the two lower Additives up >into the root Additive: > > Additive > | | | > 1 2 3
Ive recently spent a bit of time myself thinking about convenient ways to specify and expose the AST, so Ive found this entire thread to be quite interesting. But let me use just the above as a working example. Ive been developing APG an ABNF Parse Generator off and on over the past few years. It is, I think, equivalent to a PEG parser even though it uses the ABNF (RFC 4234) grammar form. All of the ABNF terms are represented by operators which recognize phrases (recognition-based) and additionally I have adopted the prioritized-choice disambiguation rule and syntactic predicates as defined by PEG. The most recent changes I have made address many of the problems discussed here. The rule above might be written as Additive = Atom *(Operator Atom) Operator = "+" / "-" Atom = 1*digit digit = "0" / "1" / ... /"9" The full parse tree for "1+2+3" would be Additive (1+2+3) | cat-op (1+2+3) | _________________________________ | | | star-op | (+2+3) | | | _____________________ | | | | cat-op cat-op | (+2) (+3) | | | | __________ __________ | | | | | Atom Operator Atom Operator Atom (1) (+) (2) (+) (3) | | | | | digit + digit + digit (1) (2) (3) | | | 1 2 3 The parentheses indicate the phrases recognized by the individual node operators. I have approached the AST in the following way. 1. I have removed all AST (and semantic action) information from the grammar. The grammar is a pure statement of the language, nothing more and nothing less. 2. The nodes to appear on the AST are specified at run time just before the parse. A special function ulParseAstInit( list) is called with list being a list of true/false values, one for each rule name (its even simpler in practice, but the details are unnecessary here.) 3. (Side Bar: all semantic actions are defined at run time in a similar fashion.) 4. Because each AST node captures the collected, concatenated phrase that it matches, there is no need for keeping any interior or terminal nodes. 5. Suppose we specify Additive, Atom and Operator in the ulParseAstInit list. The AST would look conceptually like this: Additive (1+2+3) | _________________________________ | | | | | Atom Operator Atom Operator Atom (1) (+) (2) (+) (3) Actually, it is generated as a flat-file list of nodes like this: Additive(down, "1+2+3") Atom(down, "1") Atom(up, "1") Operator(down, "+") Operator(up, "+") Atom(down, "2") Atom(up, "2") Operator(down, "+") Operator(up, "+") Atom(down, "2") Atom(up, "2") Additive(up, "1+2+3") A sequential pass through the flat file is identical to a depth-first traversal of the AST. However, many applications need a simple API to access the parsed phrases. To this end I have also developed a "Tree Node Model" (very, very loosely patterned after the HTML Document Object Model) which provides more direct node access. spNode = spTnmRule(rulename, iterator) Returns an iterator over the list of all "rulename" AST nodes, spNode being the first node in the list. spNode = spTnmSiblings(spAnyNode, iterator) Returns an iterator over all siblings of the given spAnyNode. spNode = spTnmChildren(spAnyNode, iterator) Returns an iterator over all children of the given spAnyNode. There is more, but those are the basics. Ive found the TNM to be convenient for building an API to be used by developers who are not, and dont want to be, parsing experts. Michaeljohn Clement wrote: >It's also helpful in debugging a grammar to exclude nothing >to see everything that is being tried without needing to alter >the grammar. For creating a human-readable parse tree, on the other >hand, I usually want it to drop everything it possibly can. Speaking of debugging this is another area Ive put some effort into on my latest version. For debugging I do a complete, real-time print out of the entire syntax tree traversal, as it happens. This includes all states (pre-branch, post-branch match, empty, nomatch) of all nodes (interior, non-terminal and terminal.) This includes even those alternate branches that fail and never show up in the parse tree. The problem with this is that it can run into hundreds of thousands, even millions of lines of output. Finding the bug is like looking for a needle in a haystack. For my latest version I have provided very fine-grained controls over exactly which lines get printed. You can control very precisely which states and nodes get printed right down to specifying by name which non-terminal nodes to print. You can also control the range of line numbers print only the last 1000, for example. With a little practice and a few cycles it is pretty easy to quickly zero in on the offending node and/or input string phrase that is failing. If you want to take a look at how this all works it is now available for free (GPL) as Version 6.0 from my web site. (http://www.coasttocoastresearch.com/apg-6.0/description-60.php) Thanks, Lowell Thomas low...@coasttocoastresearch.com www.coasttocoastresearch.com _______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg