Hi all,

I've been thinking about a way of implementing a PEG parser over the
last couple of days for a project I'm working on, and I wanted to ask
if people thought it was a crazy idea. I'm looking into using PEGs to
develop parsers for a Java IDE, and since I have to integrate with
their infrastructure I have some unusual restrictions. I have to use a
classical lexer, so I'd be using PEGs to parse tokens rather than
characters, which seems like it actually removes a few of the standard
problems with PEGs (although it probably complicates grammar
composition). Since this will be used in an interactive editor, error
recovery is very important - in fact, they recommend using
hand-written parsers for this reason.

So my idea would be to implement the parser using a Thompson-style NFA
instead of the normal recursive descent. This seems like a reasonable
choice since for many "normal" grammars large parts of the grammar are
regular (there is paper by Krzemien and Lukasiewicz which describes an
algorithm for determining the regular parts, but I can't find it
online). Regular rules could be inlined such that the remaining
productions were those that required more complex rules (recursion or
ordered choice). This is similar to Ragel, where a large regular
grammar is compiled into a single large DFA with actions on
transitions.

For ordered choice all branches would be explored simultaneously. The
list of current states would thus not be just a set, but a set with
dependencies between the items in the set - if the first branch of the
choice matches, the remaining ones would be discarded. Similarly for
predicates, the predicate could be explored simultaneously with the
following production, and the production would be dependent on the
predicate. When a particular state is 'split' at a decision
dependencies would also have to be maintained between the resulting
states, thus forming a tree structure. I'd use a stack for recursive
rules, i.e. "call" a particular state and then "return" to the
previous one when done.

So parsing would then simply be a single forward scan over the input,
trading memory (and implementation complexity) for backtracking. One
nice property is that it only consumes memory "on demand" - it will
only use memory when required to disambiguate. Also, at any point you
can easily identify which rules are active and they're prioritised.
Follow sets for ANTLR style recovery can be trivially calculated on
demand, and in fact if a rule fails it could continue to be explored
in recovery mode and then a decision made later about whether to keep
the recovered match or discard it in favour of a better match, either
a recovery of a higher-priority rule, an actual match of a
lower-priority one or a recovery that required "less" work to recover
by some metric.

Does this sound like a reasonable strategy? Has anyone tried something
similar to this? Any pitfalls I haven't thought of?

Thanks for any feedback,
Colin

_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Reply via email to