Background to PR#127 JENA-1147 : Introduce FactoryRDF

Andy Seaborne Sat, 05 Mar 2016 08:43:19 -0800

On 05/03/16 14:32, afs wrote:

GitHub user afs opened a pull request:


     https://github.com/apache/jena/pull/127

     JENA-1147 : Introduce FactoryRDF

This PR is the refactoring and better separates the concerns of theParserProfile which couples parsers to Node/triple etc.

In initial testing with caching, I'm getting space saving of 30-50% whenparsing into memory datasets.


For example:

Data: chembl_20.0_unichem.ttl.gz
which is 12,643,734 triples

General/Standard      Space=6632.41 MB
General/Caching       Space=3513.50 MB  Cache=77.9%
TIM/Standard          Space=9382.10 MB
TIM/Caching           Space=6261.84 MB  Cache=77.9%

i.e. saves 3G of Node space.

General:
  the usual DatasetGraph implementation based on holding Graph objects.
TIM:
  the transactional in-memory DatasetGraph
/Standard:
  With this PR, no caching
/Caching:
  With this PR and using a 10,000 slot LRU cache on createURI

10,000 is an untuned random guess at the moment.

%-age - the cache hit rate.

(see FactoryRDFCaching in the PR as an illustration - this is not on theparser path in the PR as I want to test the setting a bit first).

Parsing speed is not adversely affected when parsing and storingin-memory - in fact, it's slightly faster with caching (lower GC costs?).

I'm still investigating the impact on parsing-only of caching. Parsingto a sink, after triple creation, is currently measurably slower withcaching and I want to investigate further.



Using:
Chembl/chembl_20.0_assay.ttl.gz

49,952,240 triples : Parse/Cache    131.40s (Avg: 380,154)
49,952,240 triples : Parse/Standard 108.83s (Avg: 459,014)

Old style RIOT before this PR
 111.67 sec  49,952,240 triples  447,320.14 TPS

(I'm seeing +/- ~5s variability in these timings)

        Andy

Background to PR#127 JENA-1147 : Introduce FactoryRDF

Reply via email to