[ANN] pex, a powerful PEG parsing library

Ghadi Shayban Mon, 16 Nov 2015 22:06:49 -0800

Here is a compliant JSON parser in 99 LOC, implemented with *pex, a new 
parsing library*. [1]


This is alpha software.

Hot off the heels of Colin Fleming's Conj talk on PEGs [2], I'm making 
public an early version of pex [3], a parsing library in the PEG family. 
 For those of you familiar with Lua's LPEG library, this is similar, but 
with a Clojure-y twist.  Like LPEG, pex uses a virtual machine to parse 
rules (as opposed to combinators or packrat.)  Unlike Colin's talk, pex 
operates on traditional character types, not generic sequences/data 
structures.

Why? Parsing Expression Grammars are simpler than most other grammar types, 
but more powerful than regular expressions. They do not introduce ambiguity 
-- you get one valid parse or none.  There exists a nice space *in between 
regexes (yuck) and **instaparse (power but ambiguity)*.

Here is a tiny *example* grammar to match floating point numbers:

(def Number '{number  [digits (? fractional) (? exponent)]
              fractional ["." digits]
              exponent   ["e" (? (/ "+" "-")) digits]
              digits     [(class num) (* (class num))]})

The only other input this particular grammar needs is to let pex know what 
a `num` character class is.  (There is an interface that you can implement 
to match things, and several helpers.  I'm planning to have several common 
ones out of the box.)  Well, you also need to tell the grammar compiler 
what rule to start with (number).

The grammar format has user defined *macros* which let you hide a lot of 
boilerplate, or make higher order rules.  For example, it's very common to 
chew whitespace after rules, so hiding that is useful.  There are also 
*captures* and *actions* that operate on a virtual "Value Stack".  For 
example, while parsing a JSON array, you push all the captured values from 
the array onto the stack, then reduce them into a vector with an action.

It's very early, but pex's completely unoptimized engine can parse a 1.5MB 
file in ~58ms vs ~9ms for Cheshire/Jackson, which is a handwritten 
highly-tuned parser with many thousands of lines of code behind it.  I plan 
on closing that gap by a) implementing some of LPEG's compiler 
optimizations and b) improving some of the terribly naive impls in the 
parser. The win here is *high expressive power per unit of performance*, 
not raw performance... 

Internally, the grammar data structure is analyzed, compiled into special 
parsing bytecode, and then subsequently run inside a virtual machine [4].

Hope you can find this useful in your data munging endeavors.  Next up is 
to make CSV & EDN example parsers, tune the performance, make grammar 
debugging better, and write more docs & tests.  I encourage any feedback.

[1] 
https://github.com/ghadishayban/pex/blob/master/src/com/champbacon/pex/examples/json.clj#L7-L39
[2] https://www.youtube.com/watch?v=kt4haSH2xcs
[3] https://github.com/ghadishayban/pex
[4] 
https://github.com/ghadishayban/pex/blob/master/src-java/com/champbacon/pex/impl/PEGByteCodeVM.java#L247-L280

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[ANN] pex, a powerful PEG parsing library

Reply via email to