Re: Parsing a Single Node

Rob Vesse Wed, 26 Jun 2013 15:58:29 -0700

This is not a reply to your main email topic hence the changed subject line


SSE.parseNode() parses in a single node from a string optionally using a
PrefixMapping to expand prefixed names.  Tt allows for Turtle/SPARQL like
syntax and is pretty fast at what it does.

Often Jena already does what you need, sometimes you just need to ask if
the swiss army knife already has the desired attachment!

On that note also see SortedDataBag for non-memory bounded sorting of very
large arrays of triples, this is how ARQ handles sorting very large query
results without OOMing

Rob



On 6/26/13 2:53 PM, "Paul Houle" <[email protected]> wrote:

>     I've had many requests to port some of the advances in my
>infovore framework to Jena and now I'm getting around to that.
>
>     My program Infovore at github
>
>https://github.com/paulhoule/infovore
>
>     has a module called "parallel super eyeball" which,  like the
>eyeball program,  checks a RDF file for trouble,  but does not crash
>when it finds it.  One simplifing trick was to accept only N-Triples
>and close variants,  such as the Freebase export files.  This means I
>can reliable break up triples into nodes by breaking on the first two
>bits of whitespace,  then parse the nodes separately.
>
>     I hacked away at the triple parser in Jena to produce something
>that parses a single node and I did it in a surgical way so there is a
>pretty good chance it is correct.  The result is here:
>
>https://github.com/paulhoule/infovore/tree/master/millipede/src/main/java/
>com/ontology2/rdf/parser
>
>     The real trouble with it is that it is terribly slow,  so slow
>that I was about to give up on it before introducing a parse cache,
>which is the function createNodeParseCache() in
>
>https://github.com/paulhoule/infovore/blob/master/millipede/src/main/java/
>com/ontology2/rdf/JenaUtil.java
>
>     This sped it up to the point where I was not motivated to work
>out how to speed it up,  but this work should happen.  I'm sure that
>the parser is doing a lot of set-up work,  some of which is
>superfluous,  and also I'm certain that a handwritten parser could be
>faster than the generated parser as well.  Seeing how many billion of
>triples there are ought there,  a handwritten node parser may be worth
>the effort.
>
>----
>
>    On another note,  I couldn't help but notice that it's easy to
>fill up memory with identical Node objects as seen in the following
>test:
>
>https://github.com/paulhoule/infovore/blob/master/millipede/src/test/java/
>com/ontology2/rdf/UnderstandNodeMemoryBehavior.java
>
>    Given that many graphs repeat the same node values a lot,  I wrote
>some Economizer classes,  tested in there,  that make a cache for
>recently created Node and Triple objects.  Perhaps I was being a bit
>silly to expect to sort very large arrays of triples in memory,  but I
>found I was greatly able to reduce the memory required by using
>"Economization".
>
>Any thoughts?

Re: Parsing a Single Node

Reply via email to