Peter Goodman wrote:
>Given the grammar:
>
>Atom
>    : -<number>    /* note: this is a non-excludable */
>    ;
>
>Additive
>    : ^Additive '+' Atom
>    : Atom
>    ;
>
>And the sequence of tokens: 1 + 2 + 3, the untransformed parse tree
>appears as:
>
>                   Additive
>                  /    |   \
>           Additive    +    Atom
>          /    |   \          |
>   Additive    +    Atom      3
>       |              |
>     Atom             2
>       |
>       1
>
>The automatic reductions raise up the numbers in place of the Atom
>as follows, and the '+' tokens are dropped as they are excludable:
>
>                   Additive
>                  /        \
>           Additive         3
>          /        \
>   Additive         2
>       |
>       1
>
>And finally the ^ raises the children of the two lower Additives up
>into the root Additive:
>
>   Additive
>   |  |  |
>   1  2  3

I’ve recently spent a bit of time myself thinking about convenient ways to
specify and expose the AST, so I’ve found this entire thread to be quite
interesting. But let me use just the above as a working example.

I’ve been developing APG – an ABNF Parse Generator – off and on over the
past few years. It is, I think, equivalent to a PEG parser even though it
uses the ABNF (RFC 4234) grammar form. All of the ABNF terms are represented
by operators which recognize phrases (recognition-based) and additionally I
have adopted the prioritized-choice disambiguation rule and syntactic
predicates as defined by PEG.

The most recent changes I have made address many of the problems discussed
here. The rule above might be written as
Additive = Atom *(Operator Atom)
Operator = "+" / "-"
Atom = 1*digit
digit = "0" / "1" / ... /"9"

The full parse tree for "1+2+3" would be

                   Additive
                   (1+2+3)
                      |
                   cat-op
                   (1+2+3)
                      |
      _________________________________
      |                               |
      |                           star-op
      |                           (+2+3)
      |                               |
      |                     _____________________
      |                     |                    |
      |                  cat-op               cat-op
      |                   (+2)                 (+3)
      |                     |                    |
      |                  __________          __________
      |                  |        |          |        |
    Atom               Operator  Atom       Operator  Atom
     (1)                 (+)     (2)          (+)      (3)
      |                   |       |            |        |
    digit                 +     digit          +      digit
     (1)                         (2)                   (3)
      |                           |                     |
      1                           2                     3


The parentheses indicate the phrases recognized by the individual node
operators. I have approached the AST in the following way.

1. I have removed all AST (and semantic action) information from the
grammar. The grammar is a pure statement of the language, nothing more and
nothing less.
2. The nodes to appear on the AST are specified at run time just before the
parse. A special function ulParseAstInit( list) is called with list being a
list of true/false values, one for each rule name (its even simpler in
practice, but the details are unnecessary here.)
3. (Side Bar: all semantic actions are defined at run time in a similar
fashion.)
4. Because each AST node captures the collected, concatenated phrase that it
matches, there is no need for keeping any interior or terminal nodes.
5. Suppose we specify Additive, Atom and Operator in the ulParseAstInit
list. The AST would look conceptually like this:

                   Additive
                   (1+2+3)
                      |
      _________________________________
      |       |       |       |       |
    Atom  Operator  Atom  Operator  Atom
     (1)     (+)     (2)     (+)     (3)

Actually, it is generated as a flat-file list of nodes like this:
Additive(down, "1+2+3")
  Atom(down, "1")
  Atom(up, "1")
  Operator(down, "+")
  Operator(up, "+")
  Atom(down, "2")
  Atom(up, "2")
  Operator(down, "+")
  Operator(up, "+")
  Atom(down, "2")
  Atom(up, "2")
Additive(up, "1+2+3")

A sequential pass through the flat file is identical to a depth-first
traversal of the AST.

However, many applications need a simple API to access the parsed phrases.
To this end I have also developed a "Tree Node Model" (very, very loosely
patterned after the HTML Document Object Model) which provides more direct
node access.

spNode = spTnmRule(rulename, iterator)
Returns an iterator over the list of all "rulename" AST nodes, spNode being
the first node in the list.

spNode = spTnmSiblings(spAnyNode, iterator)
Returns an iterator over all siblings of the given spAnyNode.

spNode = spTnmChildren(spAnyNode, iterator)
Returns an iterator over all children of the given spAnyNode.

There is more, but those are the basics. I’ve found the TNM to be convenient
for building an API to be used by developers who are not, and don’t want to
be, parsing experts.

Michaeljohn Clement wrote:
>It's also helpful in debugging a grammar to exclude nothing
>to see everything that is being tried without needing to alter
>the grammar.  For creating a human-readable parse tree, on the other
>hand, I usually want it to drop everything it possibly can.

Speaking of debugging – this is another area I’ve put some effort into on my
latest version. For debugging I do a complete, real-time print out of the
entire syntax tree traversal, as it happens. This includes all states
(pre-branch, post-branch – match, empty, nomatch) of all nodes (interior,
non-terminal and terminal.) This includes even those alternate branches that
fail and never show up in the parse tree. The problem with this is that it
can run into hundreds of thousands, even millions of lines of output.
Finding the bug is like looking for a needle in a haystack. For my latest
version I have provided very fine-grained controls over exactly which lines
get printed. You can control very precisely which states and nodes get
printed right down to specifying by name which non-terminal nodes to print.
You can also control the range of line numbers – print only the last 1000,
for example. With a little practice and a few cycles it is pretty easy to
quickly zero in on the offending node and/or input string phrase that is
failing.

If you want to take a look at how this all works it is now available for
free (GPL) as Version 6.0 from my web site.
(http://www.coasttocoastresearch.com/apg-6.0/description-60.php)

Thanks,
Lowell Thomas
low...@coasttocoastresearch.com
www.coasttocoastresearch.com




_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Reply via email to