On 03/02/17 19:46, A. Soroka wrote:
We've had some discussion about providing caching at various layers of Jena on a few 
occasions in the last year or two. I'd like to throw out (for discussion) a potential 
limited-scope design that would use only "off-the-shelf" parts.

What if:

1) I write a simple DatasetChanges impl that just wrote through to a TDB 
dataset. (Let's call it ListeningTDB for short.)
2) Then I write a variation on the current TIM dataset that adds the 
DatasetChanges-"listener" functionality from DatasetGraphMonitor.
3) I write some Assemblers that let you build up a TIM dataset-with-listener 
and a ListeningTDB for it, and let you load the TIM dataset, not from files or 
URLs as usual, but directly from the TDB dataset.
4) Then we could line up (for Fuseki, for example) a TIM dataset that handles 
reads in memory but writes changes through to TDB.

Obviously, this is only useful for those cases where the whole dataset can fit 
into memory, and it could have an _awful_ startup time, but I suspect that it 
might be good for a good number of cases. That would include my most important 
case, anyway.

The design sounds like a log-backed dataset (LBD). There is no need to have a full persistent triple store here. An LBD is a general component.


Am I missing anything obvious here? I often am... :sigh:

What is your most important case?

Do you have figures of the relative speeds because if the dataset fits in-memory TDB will quickly have it all in memory - and all of one index if it did the equivalent start-up. How much faster is TIM? TIM is slower than the general in-memory dataset.

I'm not saying TDB is faster but personally I'd like to know if there is a sufficient advantage to make the time spent worthwhile but I started. TDB in-memory is much slower than TDB cached from disk. "in-memory" means a crude and expensive copy-in/copy-out RAM disk -- TDB in-memory is for testing only.

TIM loads at about 100K TPS and the general dataset at about 190k TPS on my machine. [1]

bsbm-5m.nt.gz
TIM           Space=2371.75 MB  Time=48.81s  (Avg: 102,448)
General       Space=1924.94 MB  Time=25.43s  (Avg: 196,680)

with Node interning which slows loading a bit but makes 30%-50% more compact reading from .nt.gz

And parsing discussion [2]

-----------------


Looking for general components:

1/ A DatasetGraph written for speed : MRSW for the transactions and fastest indexes for data.

The general purpose dataset might already be close to this. It uses hash indexing down for the first two levels of access so is O(1) for that.

How does TIM compare to general for read?


2/ An in-memory DatasetGraph designed to store large amounts of RDF - not necessary as compact as possible, so not a complete tradeoff of speed for space. RAM is cheap. Massive in-memory data is coming. Not necessarily in-heap.

.. anyway ... some thoughts ...

        Andy

[2]
Twitter conversation the other day that covered parse speeds
https://twitter.com/AndySeaborne/status/826944192823820288

RDF-Thrift is much faster to parse.

[1]
Loading code from when TIM came into being
https://github.com/afs/jena-workspace/blob/master/src/main/java/tools/RunMemTimeSpace.java

... not pretty code.




---
A. Soroka
The University of Virginia Library

Reply via email to