Hi all, I've been thinking about a way of implementing a PEG parser over the last couple of days for a project I'm working on, and I wanted to ask if people thought it was a crazy idea. I'm looking into using PEGs to develop parsers for a Java IDE, and since I have to integrate with their infrastructure I have some unusual restrictions. I have to use a classical lexer, so I'd be using PEGs to parse tokens rather than characters, which seems like it actually removes a few of the standard problems with PEGs (although it probably complicates grammar composition). Since this will be used in an interactive editor, error recovery is very important - in fact, they recommend using hand-written parsers for this reason.
So my idea would be to implement the parser using a Thompson-style NFA instead of the normal recursive descent. This seems like a reasonable choice since for many "normal" grammars large parts of the grammar are regular (there is paper by Krzemien and Lukasiewicz which describes an algorithm for determining the regular parts, but I can't find it online). Regular rules could be inlined such that the remaining productions were those that required more complex rules (recursion or ordered choice). This is similar to Ragel, where a large regular grammar is compiled into a single large DFA with actions on transitions. For ordered choice all branches would be explored simultaneously. The list of current states would thus not be just a set, but a set with dependencies between the items in the set - if the first branch of the choice matches, the remaining ones would be discarded. Similarly for predicates, the predicate could be explored simultaneously with the following production, and the production would be dependent on the predicate. When a particular state is 'split' at a decision dependencies would also have to be maintained between the resulting states, thus forming a tree structure. I'd use a stack for recursive rules, i.e. "call" a particular state and then "return" to the previous one when done. So parsing would then simply be a single forward scan over the input, trading memory (and implementation complexity) for backtracking. One nice property is that it only consumes memory "on demand" - it will only use memory when required to disambiguate. Also, at any point you can easily identify which rules are active and they're prioritised. Follow sets for ANTLR style recovery can be trivially calculated on demand, and in fact if a rule fails it could continue to be explored in recovery mode and then a decision made later about whether to keep the recovered match or discard it in favour of a better match, either a recovery of a higher-priority rule, an actual match of a lower-priority one or a recovery that required "less" work to recover by some metric. Does this sound like a reasonable strategy? Has anyone tried something similar to this? Any pitfalls I haven't thought of? Thanks for any feedback, Colin _______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg