[PEG] Waxeye

Orlando Hill Tue, 05 Aug 2008 14:18:43 -0700

Hi, guys.

I thought here would be a good place to talk about my parser generator
project, Waxeye, since, the original inspiration came from reading
Bryan's papers.


My goal with the project is to create a tool that makes the syntax
aspect of language development as easy as possible, regardless of what
programming language is being used. I'm not there, yet, but, I've got
to the stage where it might be of interest to others.


* Grammar syntax

I was able to get by without needing to add much to the syntax of the
original PEG paper.

I made a few small changes to make grammars easier for me to read.

The first was to make the '*', '+' and '?' operators prefix like the
'&' and '!' operators are. I found that my mind would "backtrack" when
reading grammars written in postfix.

I also changed the '/' to '|' but, that could just be personal preference.

Finally, I created a short hand for case-insensitive strings by
changing the " character to mean case-insensitive.

e.g.
You can say "integer" instead of saying [iI][nN][tT][eE][gG][eE][rR]


* Modular, composable grammars

The grammar composition is inspired by PLT-Scheme's module system.
Also, all the module work is done in an external file instead of
editing the grammars directly.

e.g. This would give Ruby a new regular expression syntax.

(all-except "ruby.waxeye" Regexp)
(rename "new-regexp.waxeye" (Start . Regexp))


* End of input checking

By default, a Waxeye parser checks that all input has been consumed
after parsing has finished. It's not a big deal but, it's helpful when
composing grammars if the grammar you're embedding doesn't have '!.'
at the end of its starting non-terminal.


* Grammar testing

Once I wrote the command-line interpreter, it turned out to be pretty
easy to add a grammar tester similar to gUnit.


* Automatic AST generation

Each non-terminal rule corresponds to a node type in the AST. I
considered various ways to specify what should be included in the
trees but, I realized the main thing I wanted to do was just remove
one or two expressions.

So, the way I get rid of horizontal noise in the AST is to prefix an
expression with ':'.

e.g.
Here, the '=' character is useless information since, when processing
the tree, it is already implicit that a '=' character was read.

Assignment <- Variable :'=' Expression

When a non-terminal is never needed it can be specified with '<:'
instead of '<-'.

Ws <: *[ \t\n\r]

Of course, if someone needed to get the parse tree instead of the AST,
I could have an option to disable these voiding expressions.

The situation of vertical noise seems largely a non-issue to me since,
it's not hard to prune those trees.

If it turns out to be really needed, it would be easy to have options
to specify that, if a given rule only has one child result, that child
should take its place. That could either be handled within the parser
or as an automatically generated tree post-processor.

e.g. When only one Num is matched, have that be returned from Prod.

Prod <- Num *([*/] Ws Num)


* Memoization

I haven't experimented much with different ways to do memoization. So
far, I've just been hashing the results of non-terminals against their
id number and the input position.

This bounds things at O(n^2 log n) or, if you make the hash-table big
enough that it never needs to grow, O(n^2).

I recently moved from a method/procedure based design to a finite
automata based design. Because of that, one option would be to memoize
the results of the non-deterministic states to bring it down to O(n
log n) or O(n).

After I've finished more important features, I'll experiment with
different memoization schemes.


* Runtime targets

I've currently got runtimes for Java, Ruby and Scheme. I'll probably
add a C runtime soon since, I have some half-done code and they don't
take long to write.


* License
Waxeye and the runtimes are under the MIT/X11 license.


Some of the things on my TODO list:


* Left recursion

I haven't bothered to put left recursion in, yet, since, I rarely
write my grammars that way. Having said that, I do think it's
important so, I'll be putting it in soon.


* Error reporting and recovery

Again, I haven't spent much time on this, yet, but, I know it's
important to make the parsers and the tool itself more usable.


* Context sensitive parsing

The way I'm planning to support this is to have labels for expressions
and to pass the value of the labeled expression to an external bit of
code that the user writes.

These bits of code do any needed work in the host programming language
and then return the potentially modified context.

This context is used as part of the memoization key and is recorded in
the memoized result.

A context could be something as simple as an integer for python
indentation or something more complicated like a symbol table for
parsing C.

e.g. Python

Tab <- a='\t' @indent<a>

e.g. C

Id <- a=([a-zA-Z_] *[a-zA-Z0-9_]) @addSymbol<a>


* Regression tests

I need to start building a suite of tests for the generator and
runtime targets to pass.


* A couple of bugs

I need to rewrite some of my automata generation code because it
doesn't like a few grammars at the moment.

Silly things like these;
**'a'
&&'a'
!!'a'


Anyway, download it and try it out. As a starter, you can play around
with the interpreter and Json grammar.

echo -n "{\"data\" : [-1, [{}, null], 3.14, true]}" | ./build/waxeye
-i grammars/json.waxeye

http://waxeye.org
http://sourceforge.net/projects/waxeye


Orlando Hill

p.s.

I've setup a mailing list on sourceforge so, you can give feedback
there if you want.

https://lists.sourceforge.net/lists/listinfo/waxeye-users

_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

[PEG] Waxeye

Reply via email to