On 03/02/17 19:46, A. Soroka wrote:
We've had some discussion about providing caching at various layers of Jena on a few
occasions in the last year or two. I'd like to throw out (for discussion) a potential
limited-scope design that would use only "off-the-shelf" parts.
What if:
1) I write a simple DatasetChanges impl that just wrote through to a TDB
dataset. (Let's call it ListeningTDB for short.)
2) Then I write a variation on the current TIM dataset that adds the
DatasetChanges-"listener" functionality from DatasetGraphMonitor.
3) I write some Assemblers that let you build up a TIM dataset-with-listener
and a ListeningTDB for it, and let you load the TIM dataset, not from files or
URLs as usual, but directly from the TDB dataset.
4) Then we could line up (for Fuseki, for example) a TIM dataset that handles
reads in memory but writes changes through to TDB.
Obviously, this is only useful for those cases where the whole dataset can fit
into memory, and it could have an _awful_ startup time, but I suspect that it
might be good for a good number of cases. That would include my most important
case, anyway.
The design sounds like a log-backed dataset (LBD). There is no need to
have a full persistent triple store here. An LBD is a general component.
Am I missing anything obvious here? I often am... :sigh:
What is your most important case?
Do you have figures of the relative speeds because if the dataset fits
in-memory TDB will quickly have it all in memory - and all of one index
if it did the equivalent start-up. How much faster is TIM? TIM is
slower than the general in-memory dataset.
I'm not saying TDB is faster but personally I'd like to know if there is
a sufficient advantage to make the time spent worthwhile but
I started. TDB in-memory is much slower than TDB cached from disk.
"in-memory" means a crude and expensive copy-in/copy-out RAM disk -- TDB
in-memory is for testing only.
TIM loads at about 100K TPS and the general dataset at about 190k TPS on
my machine. [1]
bsbm-5m.nt.gz
TIM Space=2371.75 MB Time=48.81s (Avg: 102,448)
General Space=1924.94 MB Time=25.43s (Avg: 196,680)
with Node interning which slows loading a bit but makes 30%-50% more
compact reading from .nt.gz
And parsing discussion [2]
-----------------
Looking for general components:
1/ A DatasetGraph written for speed : MRSW for the transactions and
fastest indexes for data.
The general purpose dataset might already be close to this. It uses hash
indexing down for the first two levels of access so is O(1) for that.
How does TIM compare to general for read?
2/ An in-memory DatasetGraph designed to store large amounts of RDF -
not necessary as compact as possible, so not a complete tradeoff of
speed for space. RAM is cheap. Massive in-memory data is coming. Not
necessarily in-heap.
.. anyway ... some thoughts ...
Andy
[2]
Twitter conversation the other day that covered parse speeds
https://twitter.com/AndySeaborne/status/826944192823820288
RDF-Thrift is much faster to parse.
[1]
Loading code from when TIM came into being
https://github.com/afs/jena-workspace/blob/master/src/main/java/tools/RunMemTimeSpace.java
... not pretty code.
---
A. Soroka
The University of Virginia Library