Fast & Reliable Node Parser

Paul Houle Wed, 26 Jun 2013 15:33:25 -0700

     I've had many requests to port some of the advances in my
infovore framework to Jena and now I'm getting around to that.


     My program Infovore at github

https://github.com/paulhoule/infovore

     has a module called "parallel super eyeball" which,  like the
eyeball program,  checks a RDF file for trouble,  but does not crash
when it finds it.  One simplifing trick was to accept only N-Triples
and close variants,  such as the Freebase export files.  This means I
can reliable break up triples into nodes by breaking on the first two
bits of whitespace,  then parse the nodes separately.

     I hacked away at the triple parser in Jena to produce something
that parses a single node and I did it in a surgical way so there is a
pretty good chance it is correct.  The result is here:

https://github.com/paulhoule/infovore/tree/master/
millipede/src/main/java/com/ontology2/rdf/parser

     The real trouble with it is that it is terribly slow,  so slow
that I was about to give up on it before introducing a parse cache,
which is the function createNodeParseCache() in

https://github.com/paulhoule/infovore/blob/master/
millipede/src/main/java/com/ontology2/rdf/JenaUtil.java

     This sped it up to the point where I was not motivated to work
out how to speed it up,  but this work should happen.  I'm sure that
the parser is doing a lot of set-up work,  some of which is
superfluous,  and also I'm certain that a handwritten parser could be
faster than the generated parser as well.  Seeing how many billion of
triples there are ought there,  a handwritten node parser may be worth
the effort.

----

    On another note,  I couldn't help but notice that it's easy to
fill up memory with identical Node objects as seen in the following
test:

https://github.com/paulhoule/infovore/blob/master/
millipede/src/test/java/com/ontology2/rdf/UnderstandNodeMemoryBehavior.java

    Given that many graphs repeat the same node values a lot,  I wrote
some Economizer classes,  tested in there,  that make a cache for
recently created Node and Triple objects.  Perhaps I was being a bit
silly to expect to sort very large arrays of triples in memory,  but I
found I was greatly able to reduce the memory required by using
"Economization".

Any thoughts?

Fast & Reliable Node Parser

Reply via email to