Hi, guys. I thought here would be a good place to talk about my parser generator project, Waxeye, since, the original inspiration came from reading Bryan's papers.
My goal with the project is to create a tool that makes the syntax aspect of language development as easy as possible, regardless of what programming language is being used. I'm not there, yet, but, I've got to the stage where it might be of interest to others. * Grammar syntax I was able to get by without needing to add much to the syntax of the original PEG paper. I made a few small changes to make grammars easier for me to read. The first was to make the '*', '+' and '?' operators prefix like the '&' and '!' operators are. I found that my mind would "backtrack" when reading grammars written in postfix. I also changed the '/' to '|' but, that could just be personal preference. Finally, I created a short hand for case-insensitive strings by changing the " character to mean case-insensitive. e.g. You can say "integer" instead of saying [iI][nN][tT][eE][gG][eE][rR] * Modular, composable grammars The grammar composition is inspired by PLT-Scheme's module system. Also, all the module work is done in an external file instead of editing the grammars directly. e.g. This would give Ruby a new regular expression syntax. (all-except "ruby.waxeye" Regexp) (rename "new-regexp.waxeye" (Start . Regexp)) * End of input checking By default, a Waxeye parser checks that all input has been consumed after parsing has finished. It's not a big deal but, it's helpful when composing grammars if the grammar you're embedding doesn't have '!.' at the end of its starting non-terminal. * Grammar testing Once I wrote the command-line interpreter, it turned out to be pretty easy to add a grammar tester similar to gUnit. * Automatic AST generation Each non-terminal rule corresponds to a node type in the AST. I considered various ways to specify what should be included in the trees but, I realized the main thing I wanted to do was just remove one or two expressions. So, the way I get rid of horizontal noise in the AST is to prefix an expression with ':'. e.g. Here, the '=' character is useless information since, when processing the tree, it is already implicit that a '=' character was read. Assignment <- Variable :'=' Expression When a non-terminal is never needed it can be specified with '<:' instead of '<-'. Ws <: *[ \t\n\r] Of course, if someone needed to get the parse tree instead of the AST, I could have an option to disable these voiding expressions. The situation of vertical noise seems largely a non-issue to me since, it's not hard to prune those trees. If it turns out to be really needed, it would be easy to have options to specify that, if a given rule only has one child result, that child should take its place. That could either be handled within the parser or as an automatically generated tree post-processor. e.g. When only one Num is matched, have that be returned from Prod. Prod <- Num *([*/] Ws Num) * Memoization I haven't experimented much with different ways to do memoization. So far, I've just been hashing the results of non-terminals against their id number and the input position. This bounds things at O(n^2 log n) or, if you make the hash-table big enough that it never needs to grow, O(n^2). I recently moved from a method/procedure based design to a finite automata based design. Because of that, one option would be to memoize the results of the non-deterministic states to bring it down to O(n log n) or O(n). After I've finished more important features, I'll experiment with different memoization schemes. * Runtime targets I've currently got runtimes for Java, Ruby and Scheme. I'll probably add a C runtime soon since, I have some half-done code and they don't take long to write. * License Waxeye and the runtimes are under the MIT/X11 license. Some of the things on my TODO list: * Left recursion I haven't bothered to put left recursion in, yet, since, I rarely write my grammars that way. Having said that, I do think it's important so, I'll be putting it in soon. * Error reporting and recovery Again, I haven't spent much time on this, yet, but, I know it's important to make the parsers and the tool itself more usable. * Context sensitive parsing The way I'm planning to support this is to have labels for expressions and to pass the value of the labeled expression to an external bit of code that the user writes. These bits of code do any needed work in the host programming language and then return the potentially modified context. This context is used as part of the memoization key and is recorded in the memoized result. A context could be something as simple as an integer for python indentation or something more complicated like a symbol table for parsing C. e.g. Python Tab <- a='\t' @indent<a> e.g. C Id <- a=([a-zA-Z_] *[a-zA-Z0-9_]) @addSymbol<a> * Regression tests I need to start building a suite of tests for the generator and runtime targets to pass. * A couple of bugs I need to rewrite some of my automata generation code because it doesn't like a few grammars at the moment. Silly things like these; **'a' &&'a' !!'a' Anyway, download it and try it out. As a starter, you can play around with the interpreter and Json grammar. echo -n "{\"data\" : [-1, [{}, null], 3.14, true]}" | ./build/waxeye -i grammars/json.waxeye http://waxeye.org http://sourceforge.net/projects/waxeye Orlando Hill p.s. I've setup a mailing list on sourceforge so, you can give feedback there if you want. https://lists.sourceforge.net/lists/listinfo/waxeye-users _______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg