Re: caching using wrapped datasets

Andy Seaborne Fri, 03 Feb 2017 14:00:37 -0800


On 03/02/17 19:46, A. Soroka wrote:

We've had some discussion about providing caching at various layers of Jena on a few 
occasions in the last year or two. I'd like to throw out (for discussion) a potential 
limited-scope design that would use only "off-the-shelf" parts.

What if:

1) I write a simple DatasetChanges impl that just wrote through to a TDB 
dataset. (Let's call it ListeningTDB for short.)
2) Then I write a variation on the current TIM dataset that adds the 
DatasetChanges-"listener" functionality from DatasetGraphMonitor.
3) I write some Assemblers that let you build up a TIM dataset-with-listener 
and a ListeningTDB for it, and let you load the TIM dataset, not from files or 
URLs as usual, but directly from the TDB dataset.
4) Then we could line up (for Fuseki, for example) a TIM dataset that handles 
reads in memory but writes changes through to TDB.

Obviously, this is only useful for those cases where the whole dataset can fit 
into memory, and it could have an _awful_ startup time, but I suspect that it 
might be good for a good number of cases. That would include my most important 
case, anyway.

The design sounds like a log-backed dataset (LBD). There is no need tohave a full persistent triple store here. An LBD is a general component.


Am I missing anything obvious here? I often am... :sigh:


What is your most important case?

Do you have figures of the relative speeds because if the dataset fitsin-memory TDB will quickly have it all in memory - and all of one indexif it did the equivalent start-up. How much faster is TIM? TIM isslower than the general in-memory dataset.

I'm not saying TDB is faster but personally I'd like to know if there isa sufficient advantage to make the time spent worthwhile butI started. TDB in-memory is much slower than TDB cached from disk."in-memory" means a crude and expensive copy-in/copy-out RAM disk -- TDBin-memory is for testing only.

TIM loads at about 100K TPS and the general dataset at about 190k TPS onmy machine. [1]


bsbm-5m.nt.gz
TIM           Space=2371.75 MB  Time=48.81s  (Avg: 102,448)
General       Space=1924.94 MB  Time=25.43s  (Avg: 196,680)

with Node interning which slows loading a bit but makes 30%-50% morecompact reading from .nt.gz


And parsing discussion [2]

-----------------


Looking for general components:

1/ A DatasetGraph written for speed : MRSW for the transactions andfastest indexes for data.

The general purpose dataset might already be close to this. It uses hashindexing down for the first two levels of access so is O(1) for that.


How does TIM compare to general for read?

2/ An in-memory DatasetGraph designed to store large amounts of RDF -not necessary as compact as possible, so not a complete tradeoff ofspeed for space. RAM is cheap. Massive in-memory data is coming. Notnecessarily in-heap.


.. anyway ... some thoughts ...

        Andy

[2]
Twitter conversation the other day that covered parse speeds
https://twitter.com/AndySeaborne/status/826944192823820288

RDF-Thrift is much faster to parse.

[1]
Loading code from when TIM came into being
https://github.com/afs/jena-workspace/blob/master/src/main/java/tools/RunMemTimeSpace.java

... not pretty code.



---
A. Soroka
The University of Virginia Library

Re: caching using wrapped datasets

Reply via email to