Re: Timing tests for jena-624: even a little better

A. Soroka Sat, 26 Sep 2015 14:29:58 -0700

Sorry for spamming the list a bit today, but before COB I wanted to offer some 
more figures on this effort. Using a port of Scala’s immutable collections [*] 
in a new branch [**] the new implementation is now seeing a little better than 
half the load performance of the “stock” impl (see below sig). Of course these 
figures are very rough, but hopefully they demonstrate motion in the right 
direction. I still intend to try out Clojure’s collections, but I think I’m a 
lot closer to a realistic level of performance. I hope to demonstrate something 
about the query performance here soon.


[*] https://github.com/andrewoma/dexx

[**] https://github.com/ajs6f/jena/tree/jena-624-dexx

Anyone who is interested in examining these branches should be aware that they 
are currently moving targets— commits several times a day.

---
A. Soroka
The University of Virginia Library



Running org.apache.jena.sparql.core.mem.PerfTest
==== Data: /Users/ajs6f/Documents/jena/bsbm-1m.nt.gz ====
    Size: 1,000,312 (2.978s, 335,900 tps)
==== DSG/mix/auto (warm N=3)
==== DSG/mix/txn  (warm N=3)
==== DSG/mem/auto (warm N=3)
==== DSG/mem/txn  (warm N=3)
==== DSG/mix/auto (N=20)
==== DSG/mix/auto (N=20) Time: 97.761s (204,644 tps)
==== DSG/mix/txn  (N=20)
==== DSG/mix/txn  (N=20) Time: 101.668s (196,780 tps)
==== DSG/mem/auto (N=20)
==== DSG/mem/auto (N=20) Time: 211.971s (94,381 tps)
==== DSG/mem/txn  (N=20)
==== DSG/mem/txn  (N=20) Time: 151.359s (132,177 tps)

> On Sep 26, 2015, at 1:31 PM, A. Soroka <[email protected]> wrote:
> 
> I’ve committed the change to using separate triple and quad indexes (via 
> DatasetGraphTriplesQuads). There appears to be definite and significant 
> improvement, from Andy’s numbers showing the current implementation getting 5 
> times the load performance of the new implementation to my numbers (below) 
> which show the new impl improved so that the current impl is at maybe 2.5 
> times its performance. Thanks for that advice, Andy! 
> 
> I’ll probably take a look next at moving to a more powerful library for 
> persistent structures that might either perform better raw or offer finer 
> control over tree creation as discussed above in this thread.
> 
> On a related note, are there any Jena standard parts for query testing for 
> this kind of situation? I know that BSBM has several sophisticated suites of 
> tests defined, but are any of them considered particularly appropriate, or 
> has anyone out there in dev-land built their own harness for BSBM or 
> something else that I could “borrow”? {grin}
> 
> — 
> A. Soroka
> The University of Virginia Library
> 
> === Data: /Users/ajs6f/Documents/jena/bsbm-1m.nt.gz ====
>    Size: 1,000,312 (2.947s, 339,434 tps)
> ==== DSG/mix/auto (warm N=3)
> ==== DSG/mix/txn  (warm N=3)
> ==== DSG/mem/auto (warm N=3)
> ==== DSG/mem/txn  (warm N=3)
> ==== DSG/mix/auto (N=20)
> ==== DSG/mix/auto (N=20) Time: 108.331s (184,676 tps)
> ==== DSG/mix/txn  (N=20)
> ==== DSG/mix/txn  (N=20) Time: 105.424s (189,769 tps)
> ==== DSG/mem/auto (N=20)
> ==== DSG/mem/auto (N=20) Time: 283.680s (70,523 tps)
> ==== DSG/mem/txn  (N=20)
> ==== DSG/mem/txn  (N=20) Time: 224.501s (89,114 tps)
> 
>> On Sep 26, 2015, at 9:21 AM, Andy Seaborne <[email protected]> wrote:
>> 
>> On 26/09/15 12:07, A. Soroka wrote:
>>> Ooh! Those numbers are awful.
>> 
>> Early days. The general purpose dataset has no features.   And, of course, a 
>> concurrent read is completely blocked - that's a major issue for some usages.
>> 
>> Access performance, having update not block query, in a very reliable 
>> implementation is a valuable thing to have. And if it is described as a 
>> "complete temporal database", it is all a good thing.  Marketing.
>> 
>> The storage implementation is now a self-contained thing to look at. ... 
>> seems there is no shortage of options ... google quickly got me:
>> 
>> http://stackoverflow.com/questions/8575723/whats-a-good-persistent-collections-framework-for-use-in-java
>> 
>> and there are more.  Various data structures I have not heard of before.
>> 
>>> Per your point 2, it does create a new
>>> tree per add/remove. And PCollections’ bulk operations are just loops
>>> over the single-element operations, so trying to accumulate data and
>>> use a single operation will create the same number of trees.
>>> Unfortunately, PCollections does not have something like Clojure’s
>>> transient operations [*], where under carefully-controlled conditions
>>> a normally persistent structure can be mutated in place for celerity
>>> of operation. I have no commitment to PCollections, and I can switch
>>> and see what happens with Clojure and transiency. But I should first
>>> go back over the code with a fine-toothed comb and make sure that
>>> there isn’t a plain old mistake of some kind.
>>> 
>>> As far as the indexes, I’m not quite sure what you mean by
>>> “triples+quads”. Do you mean a single map from graph name to  three
>>> triple-covering indexes? Something like Map<Node, TripleIndex>, with
>>> TripleIndex having within it three covering indexes for triples in
>>> the way that current HexIndex has within it six covering indexes for
>>> quads?
>> 
>> That's one way - I meant using the supporting framework in 
>> DatasetGraphTriplesQuads so
>> 
>> DatasetGraphQuads => DatasetGraphTriplesQuads
>> 
>> The default graph is handled separately from named graphs.
>> 
>> TDB uses this - there is a triple table (dft: 3 index) and a quads table 
>> (dft: 6 index)
>> 
>>      Andy
>> 
>>> 
>>> --- A. Soroka The University of Virginia Library
>>> 
>>> [*] http://clojure.org/transients
>>> 
>>>> On Sep 26, 2015, at 6:42 AM, Andy Seaborne <[email protected]>
>>>> wrote:
>>>> 
>>>> Some thoughts:
>>>> 
>>>> 1/ If it were a triples+quads design (TripleTable, QuadTable) , not
>>>> just quads, there would be 3 indexes not 6 for triples so 2x
>>>> faster.
>>>> 
>>>> 2/ As autocommit and txn forms are nearly the same, I guess that
>>>> every add(Quad) is causing a new pcollections tree for each index.
>>>> 
>>>> I don't know pcollections but is it possible to use it so a
>>>> independent tree is created only at begin(W). i.e. copy-to-root
>>>> does not happen on stuff updated already touched after begin(W).
>>>> 
>>>> Andy
>>> 
>> 
>

Re: Timing tests for jena-624: even a little better

Reply via email to