I've had many requests to port some of the advances in my
infovore framework to Jena and now I'm getting around to that.
My program Infovore at github
https://github.com/paulhoule/infovore
has a module called "parallel super eyeball" which, like the
eyeball program, checks a RDF file for trouble, but does not crash
when it finds it. One simplifing trick was to accept only N-Triples
and close variants, such as the Freebase export files. This means I
can reliable break up triples into nodes by breaking on the first two
bits of whitespace, then parse the nodes separately.
I hacked away at the triple parser in Jena to produce something
that parses a single node and I did it in a surgical way so there is a
pretty good chance it is correct. The result is here:
https://github.com/paulhoule/infovore/tree/master/
millipede/src/main/java/com/ontology2/rdf/parser
The real trouble with it is that it is terribly slow, so slow
that I was about to give up on it before introducing a parse cache,
which is the function createNodeParseCache() in
https://github.com/paulhoule/infovore/blob/master/
millipede/src/main/java/com/ontology2/rdf/JenaUtil.java
This sped it up to the point where I was not motivated to work
out how to speed it up, but this work should happen. I'm sure that
the parser is doing a lot of set-up work, some of which is
superfluous, and also I'm certain that a handwritten parser could be
faster than the generated parser as well. Seeing how many billion of
triples there are ought there, a handwritten node parser may be worth
the effort.
----
On another note, I couldn't help but notice that it's easy to
fill up memory with identical Node objects as seen in the following
test:
https://github.com/paulhoule/infovore/blob/master/
millipede/src/test/java/com/ontology2/rdf/UnderstandNodeMemoryBehavior.java
Given that many graphs repeat the same node values a lot, I wrote
some Economizer classes, tested in there, that make a cache for
recently created Node and Triple objects. Perhaps I was being a bit
silly to expect to sort very large arrays of triples in memory, but I
found I was greatly able to reduce the memory required by using
"Economization".
Any thoughts?