On 05/03/16 14:32, afs wrote:
GitHub user afs opened a pull request:
https://github.com/apache/jena/pull/127
JENA-1147 : Introduce FactoryRDF
This PR is the refactoring and better separates the concerns of the
ParserProfile which couples parsers to Node/triple etc.
In initial testing with caching, I'm getting space saving of 30-50% when
parsing into memory datasets.
For example:
Data: chembl_20.0_unichem.ttl.gz
which is 12,643,734 triples
General/Standard Space=6632.41 MB
General/Caching Space=3513.50 MB Cache=77.9%
TIM/Standard Space=9382.10 MB
TIM/Caching Space=6261.84 MB Cache=77.9%
i.e. saves 3G of Node space.
General:
the usual DatasetGraph implementation based on holding Graph objects.
TIM:
the transactional in-memory DatasetGraph
/Standard:
With this PR, no caching
/Caching:
With this PR and using a 10,000 slot LRU cache on createURI
10,000 is an untuned random guess at the moment.
%-age - the cache hit rate.
(see FactoryRDFCaching in the PR as an illustration - this is not on the
parser path in the PR as I want to test the setting a bit first).
Parsing speed is not adversely affected when parsing and storing
in-memory - in fact, it's slightly faster with caching (lower GC costs?).
I'm still investigating the impact on parsing-only of caching. Parsing
to a sink, after triple creation, is currently measurably slower with
caching and I want to investigate further.
Using:
Chembl/chembl_20.0_assay.ttl.gz
49,952,240 triples : Parse/Cache 131.40s (Avg: 380,154)
49,952,240 triples : Parse/Standard 108.83s (Avg: 459,014)
Old style RIOT before this PR
111.67 sec 49,952,240 triples 447,320.14 TPS
(I'm seeing +/- ~5s variability in these timings)
Andy