Re: Timing tests for jena-624

A. Soroka Sat, 26 Sep 2015 06:52:00 -0700

Yes, there are lots of options, which is good. They certainly vary in their 
dependence on larger ecosystems, their level of support and development, and 
their basic designs. I started with PCollections because it is very lightweight 
and offered enough functionality to get the ideas down. I'm interested in 
Clojure’s implementations because they are well-known and tested and the 
transient feature should offer some performance, but there are certainly other 
attractive candidates. Notably there are wrappers for Scala collections, and 
they (the wrapper projects, not Scala) seem to be much more active than any 
wrapper projects for Clojure. The use of Clojure data structures directly from 
Java seems to pull an awful lot after it.


DatasetGraphTriplesQuads seems like a pretty straightforward scheme. With TDB 
as the example and reuse the basic ideas I’ve already developed for the 
quad-side implementation, adding a 3-fold triples-only indexing layout for the 
triples-side. That would save on every action against the default graph, which 
must be a lot of work in a lot of use cases. It would keep all the access at 
O(1) on both sides. I will work on this first, because it seems like a 
no-brainer no matter what the underlying data structure provider is.

---
A. Soroka
The University of Virginia Library

> On Sep 26, 2015, at 9:21 AM, Andy Seaborne <[email protected]> wrote:
> 
> On 26/09/15 12:07, A. Soroka wrote:
>> Ooh! Those numbers are awful.
> 
> Early days. The general purpose dataset has no features.   And, of course, a 
> concurrent read is completely blocked - that's a major issue for some usages.
> 
> Access performance, having update not block query, in a very reliable 
> implementation is a valuable thing to have. And if it is described as a 
> "complete temporal database", it is all a good thing.  Marketing.
> 
> The storage implementation is now a self-contained thing to look at. ... 
> seems there is no shortage of options ... google quickly got me:
> 
> http://stackoverflow.com/questions/8575723/whats-a-good-persistent-collections-framework-for-use-in-java
> 
> and there are more.  Various data structures I have not heard of before.
> 
>> Per your point 2, it does create a new
>> tree per add/remove. And PCollections’ bulk operations are just loops
>> over the single-element operations, so trying to accumulate data and
>> use a single operation will create the same number of trees.
>> Unfortunately, PCollections does not have something like Clojure’s
>> transient operations [*], where under carefully-controlled conditions
>> a normally persistent structure can be mutated in place for celerity
>> of operation. I have no commitment to PCollections, and I can switch
>> and see what happens with Clojure and transiency. But I should first
>> go back over the code with a fine-toothed comb and make sure that
>> there isn’t a plain old mistake of some kind.
>> 
>> As far as the indexes, I’m not quite sure what you mean by
>> “triples+quads”. Do you mean a single map from graph name to  three
>> triple-covering indexes? Something like Map<Node, TripleIndex>, with
>> TripleIndex having within it three covering indexes for triples in
>> the way that current HexIndex has within it six covering indexes for
>> quads?
> 
> That's one way - I meant using the supporting framework in 
> DatasetGraphTriplesQuads so
> 
> DatasetGraphQuads => DatasetGraphTriplesQuads
> 
> The default graph is handled separately from named graphs.
> 
> TDB uses this - there is a triple table (dft: 3 index) and a quads table 
> (dft: 6 index)
> 
>       Andy
> 
>> 
>> --- A. Soroka The University of Virginia Library
>> 
>> [*] http://clojure.org/transients
>> 
>>> On Sep 26, 2015, at 6:42 AM, Andy Seaborne <[email protected]>
>>> wrote:
>>> 
>>> Some thoughts:
>>> 
>>> 1/ If it were a triples+quads design (TripleTable, QuadTable) , not
>>> just quads, there would be 3 indexes not 6 for triples so 2x
>>> faster.
>>> 
>>> 2/ As autocommit and txn forms are nearly the same, I guess that
>>> every add(Quad) is causing a new pcollections tree for each index.
>>> 
>>> I don't know pcollections but is it possible to use it so a
>>> independent tree is created only at begin(W). i.e. copy-to-root
>>> does not happen on stuff updated already touched after begin(W).
>>> 
>>> Andy
>> 
>

Re: Timing tests for jena-624

Reply via email to