Re: caching using wrapped datasets

A. Soroka Sat, 04 Feb 2017 09:25:56 -0800

>> because if the dataset fits in-memory TDB will quickly have it all in memory


Hm. This makes my whole idea pretty questionable. I didn't realize that TDB was 
aggressive with caching like that. From my experience TIM and the general 
in-memory dataset are pretty similar on reads, with general being faster for a 
couple of common patterns and TIM being a bit faster on less usual patterns. 
Not a huge difference. If we pulled out some of the less-often used index 
patterns from TIM I think it would land pretty close to general for speed and 
for size, but I don't know how worthwhile that work would be.

I haven't got any figures on loading speed that differ significantly from what 
you show. This whole idea was based on the (wrong) assumption that TDB leaves 
things mostly on disk. I know that there's the whole outside-the-heap mapping 
through memory going on, but I don't pretend to understand it, yet. :grin:

Well, back to the drawing board for me. As far as go efficient in-memory 
representations, I don't pretend to be deep in that but I know there are some 
practical systems out there using bit matrices and other super-compact forms. 
But I don't know what the trade-off calculations are. And then there's 
distribution-- if you can scale out, the pressure to put lots of triples in a 
single bank of RAM is smaller. I know that between those two moves, my 
principals would almost always choose for scale-out.

---
A. Soroka
The University of Virginia Library

> On Feb 3, 2017, at 5:00 PM, Andy Seaborne <[email protected]> wrote:
> 
> 
> 
> On 03/02/17 19:46, A. Soroka wrote:
>> We've had some discussion about providing caching at various layers of Jena 
>> on a few occasions in the last year or two. I'd like to throw out (for 
>> discussion) a potential limited-scope design that would use only 
>> "off-the-shelf" parts.
>> 
>> What if:
>> 
>> 1) I write a simple DatasetChanges impl that just wrote through to a TDB 
>> dataset. (Let's call it ListeningTDB for short.)
>> 2) Then I write a variation on the current TIM dataset that adds the 
>> DatasetChanges-"listener" functionality from DatasetGraphMonitor.
>> 3) I write some Assemblers that let you build up a TIM dataset-with-listener 
>> and a ListeningTDB for it, and let you load the TIM dataset, not from files 
>> or URLs as usual, but directly from the TDB dataset.
>> 4) Then we could line up (for Fuseki, for example) a TIM dataset that 
>> handles reads in memory but writes changes through to TDB.
>> 
>> Obviously, this is only useful for those cases where the whole dataset can 
>> fit into memory, and it could have an _awful_ startup time, but I suspect 
>> that it might be good for a good number of cases. That would include my most 
>> important case, anyway.
> 
> The design sounds like a log-backed dataset (LBD).  There is no need to have 
> a full persistent triple store here.  An LBD is a general component.
> 
>> 
>> Am I missing anything obvious here? I often am... :sigh:
> 
> What is your most important case?
> 
> Do you have figures of the relative speeds because if the dataset fits 
> in-memory TDB will quickly have it all in memory - and all of one index if it 
> did the equivalent start-up.  How much faster is TIM? TIM is slower than the 
> general in-memory dataset.
> 
> I'm not saying TDB is faster but personally I'd like to know if there is a 
> sufficient advantage to make the time spent worthwhile but
> I started.  TDB in-memory is much slower than TDB cached from disk. 
> "in-memory" means a crude and expensive copy-in/copy-out RAM disk -- TDB 
> in-memory is for testing only.
> 
> TIM loads at about 100K TPS and the general dataset at about 190k TPS on my 
> machine. [1]
> 
> bsbm-5m.nt.gz
> TIM           Space=2371.75 MB  Time=48.81s  (Avg: 102,448)
> General       Space=1924.94 MB  Time=25.43s  (Avg: 196,680)
> 
> with Node interning which slows loading a bit but makes 30%-50% more compact 
> reading from .nt.gz
> 
> And parsing discussion [2]
> 
> -----------------
> 
> 
> Looking for general components:
> 
> 1/ A DatasetGraph written for speed : MRSW for the transactions and fastest 
> indexes for data.
> 
> The general purpose dataset might already be close to this. It uses hash 
> indexing down for the first two levels of access so is O(1) for that.
> 
> How does TIM compare to general for read?
> 
> 
> 2/ An in-memory DatasetGraph designed to store large amounts of RDF - not 
> necessary as compact as possible, so not a complete tradeoff of speed for 
> space.  RAM is cheap.  Massive in-memory data is coming. Not necessarily 
> in-heap.
> 
> .. anyway ... some thoughts ...
> 
>       Andy
> 
> [2]
> Twitter conversation the other day that covered parse speeds
> https://twitter.com/AndySeaborne/status/826944192823820288
> 
> RDF-Thrift is much faster to parse.
> 
> [1]
> Loading code from when TIM came into being
> https://github.com/afs/jena-workspace/blob/master/src/main/java/tools/RunMemTimeSpace.java
> 
> ... not pretty code.
> 
> 
>> 
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>>

Re: caching using wrapped datasets

Reply via email to