On 05/03/16 14:32, afs wrote:
GitHub user afs opened a pull request:

     https://github.com/apache/jena/pull/127

     JENA-1147 : Introduce FactoryRDF


This PR is the refactoring and better separates the concerns of the ParserProfile which couples parsers to Node/triple etc.

In initial testing with caching, I'm getting space saving of 30-50% when parsing into memory datasets.

For example:

Data: chembl_20.0_unichem.ttl.gz
which is 12,643,734 triples

General/Standard      Space=6632.41 MB
General/Caching       Space=3513.50 MB  Cache=77.9%
TIM/Standard          Space=9382.10 MB
TIM/Caching           Space=6261.84 MB  Cache=77.9%

i.e. saves 3G of Node space.

General:
  the usual DatasetGraph implementation based on holding Graph objects.
TIM:
  the transactional in-memory DatasetGraph
/Standard:
  With this PR, no caching
/Caching:
  With this PR and using a 10,000 slot LRU cache on createURI

10,000 is an untuned random guess at the moment.

%-age - the cache hit rate.

(see FactoryRDFCaching in the PR as an illustration - this is not on the parser path in the PR as I want to test the setting a bit first).

Parsing speed is not adversely affected when parsing and storing in-memory - in fact, it's slightly faster with caching (lower GC costs?).

I'm still investigating the impact on parsing-only of caching. Parsing to a sink, after triple creation, is currently measurably slower with caching and I want to investigate further.


Using:
Chembl/chembl_20.0_assay.ttl.gz

49,952,240 triples : Parse/Cache    131.40s (Avg: 380,154)
49,952,240 triples : Parse/Standard 108.83s (Avg: 459,014)

Old style RIOT before this PR
 111.67 sec  49,952,240 triples  447,320.14 TPS

(I'm seeing +/- ~5s variability in these timings)

        Andy

Reply via email to