>> because if the dataset fits in-memory TDB will quickly have it all in memory
Hm. This makes my whole idea pretty questionable. I didn't realize that TDB was aggressive with caching like that. From my experience TIM and the general in-memory dataset are pretty similar on reads, with general being faster for a couple of common patterns and TIM being a bit faster on less usual patterns. Not a huge difference. If we pulled out some of the less-often used index patterns from TIM I think it would land pretty close to general for speed and for size, but I don't know how worthwhile that work would be. I haven't got any figures on loading speed that differ significantly from what you show. This whole idea was based on the (wrong) assumption that TDB leaves things mostly on disk. I know that there's the whole outside-the-heap mapping through memory going on, but I don't pretend to understand it, yet. :grin: Well, back to the drawing board for me. As far as go efficient in-memory representations, I don't pretend to be deep in that but I know there are some practical systems out there using bit matrices and other super-compact forms. But I don't know what the trade-off calculations are. And then there's distribution-- if you can scale out, the pressure to put lots of triples in a single bank of RAM is smaller. I know that between those two moves, my principals would almost always choose for scale-out. --- A. Soroka The University of Virginia Library > On Feb 3, 2017, at 5:00 PM, Andy Seaborne <[email protected]> wrote: > > > > On 03/02/17 19:46, A. Soroka wrote: >> We've had some discussion about providing caching at various layers of Jena >> on a few occasions in the last year or two. I'd like to throw out (for >> discussion) a potential limited-scope design that would use only >> "off-the-shelf" parts. >> >> What if: >> >> 1) I write a simple DatasetChanges impl that just wrote through to a TDB >> dataset. (Let's call it ListeningTDB for short.) >> 2) Then I write a variation on the current TIM dataset that adds the >> DatasetChanges-"listener" functionality from DatasetGraphMonitor. >> 3) I write some Assemblers that let you build up a TIM dataset-with-listener >> and a ListeningTDB for it, and let you load the TIM dataset, not from files >> or URLs as usual, but directly from the TDB dataset. >> 4) Then we could line up (for Fuseki, for example) a TIM dataset that >> handles reads in memory but writes changes through to TDB. >> >> Obviously, this is only useful for those cases where the whole dataset can >> fit into memory, and it could have an _awful_ startup time, but I suspect >> that it might be good for a good number of cases. That would include my most >> important case, anyway. > > The design sounds like a log-backed dataset (LBD). There is no need to have > a full persistent triple store here. An LBD is a general component. > >> >> Am I missing anything obvious here? I often am... :sigh: > > What is your most important case? > > Do you have figures of the relative speeds because if the dataset fits > in-memory TDB will quickly have it all in memory - and all of one index if it > did the equivalent start-up. How much faster is TIM? TIM is slower than the > general in-memory dataset. > > I'm not saying TDB is faster but personally I'd like to know if there is a > sufficient advantage to make the time spent worthwhile but > I started. TDB in-memory is much slower than TDB cached from disk. > "in-memory" means a crude and expensive copy-in/copy-out RAM disk -- TDB > in-memory is for testing only. > > TIM loads at about 100K TPS and the general dataset at about 190k TPS on my > machine. [1] > > bsbm-5m.nt.gz > TIM Space=2371.75 MB Time=48.81s (Avg: 102,448) > General Space=1924.94 MB Time=25.43s (Avg: 196,680) > > with Node interning which slows loading a bit but makes 30%-50% more compact > reading from .nt.gz > > And parsing discussion [2] > > ----------------- > > > Looking for general components: > > 1/ A DatasetGraph written for speed : MRSW for the transactions and fastest > indexes for data. > > The general purpose dataset might already be close to this. It uses hash > indexing down for the first two levels of access so is O(1) for that. > > How does TIM compare to general for read? > > > 2/ An in-memory DatasetGraph designed to store large amounts of RDF - not > necessary as compact as possible, so not a complete tradeoff of speed for > space. RAM is cheap. Massive in-memory data is coming. Not necessarily > in-heap. > > .. anyway ... some thoughts ... > > Andy > > [2] > Twitter conversation the other day that covered parse speeds > https://twitter.com/AndySeaborne/status/826944192823820288 > > RDF-Thrift is much faster to parse. > > [1] > Loading code from when TIM came into being > https://github.com/afs/jena-workspace/blob/master/src/main/java/tools/RunMemTimeSpace.java > > ... not pretty code. > > >> >> >> --- >> A. Soroka >> The University of Virginia Library >>
