On Sep 27, 2015, at 5:41 AM, Andy Seaborne <[email protected]> wrote
> I can't try out your new stuff for a few days due to not being near a
> suitable computer.
No problem. On my machine using Dexx, that port of the Scala types, the branch
shows improvement to within half of the stock performance. I have tried now
with some variations using the Clojure types (shown after my sig) and didn’t
see much difference, so I’ll leave that question alone for the moment. I wasn’t
able to use Clojure’s transient (mutate-in-place-within-a-thread/transaction)
functionality, because Clojure transients do not afford iteration, which is
needed to support find(). It seems feasible to me that a custom implementation
with the ability to use mutate-in-place within transactions might offer more
improvement, but that’s a whole ‘nuther kettle of fish.
I’ll spend some time soon moving on with the Dexx branch and trying out some
simple tests of the kind you’ve outlined below (and I’ll include something that
exercises property paths, which actually happen to be very interesting for a
few use cases in which I am interested). I’m not sure how to engage real world
use very effectively. I can certainly spin up examples, but it seems like we
would want a broader set of users than just me to try it out, no? {grin}
---
A. Soroka
The University of Virginia Library
Clojure w/o transients
Running org.apache.jena.sparql.core.mem.PerfTest
==== Data: /Users/ajs6f/Documents/jena/bsbm-1m.nt.gz ====
Size: 1,000,312 (3.551s, 281,698 tps)
==== DSG/mix/auto (warm N=3)
==== DSG/mix/txn (warm N=3)
==== DSG/mem/auto (warm N=3)
==== DSG/mem/txn (warm N=3)
==== DSG/mix/auto (N=20)
==== DSG/mix/auto (N=20) Time: 96.106s (208,168 tps)
==== DSG/mix/txn (N=20)
==== DSG/mix/txn (N=20) Time: 95.053s (210,474 tps)
==== DSG/mem/auto (N=20)
==== DSG/mem/auto (N=20) Time: 221.693s (90,242 tps)
==== DSG/mem/txn (N=20)
==== DSG/mem/txn (N=20) Time: 168.189s (118,950 tps)
>
> On 26/09/15 18:31, A. Soroka wrote:
>> On a related note, are there any Jena standard parts for query
>> testing for this kind of situation? I know that BSBM has several
>> sophisticated suites of tests defined, but are any of them considered
>> particularly appropriate, or has anyone out there in dev-land built
>> their own harness for BSBM or something else that I could “borrow”?
>> {grin}
>
> Benchmarks like BSBM are looking at scale in a way that is different. BSBM is
> as much about the mem-storage boundary.
>
> For the general purpose in-memory dataset, the need is for some lower level
> tests mainly to ensure nothing really bad, and easily addressable is
> happening.
>
> SPARQL execution is only lightly going to be influenced by dataset speed.
> Complex queries do a lot of intermediate processing (e.g. sorting) and that's
> not to do with the base data. One exception (isn't there always) is property
> paths. The current implementation can hit the store at fine grain quite
> hard; the ideal is better algorithms for property paths but it also presents
> what code that directly uses the API might do.
>
> In TDB, it would be better to computer in NodeIds but the current integration
> gets the Nodes IIRC. [Hmm - there is a fairly obvious way to fix that ...
> different discussion.]
>
> A few simple tests that come to mind are:
>
> 1. count all triples - test end to end scan of the dataset
> 2. write the whole dataset to /dev/null.
> 3. same as above but for a graph, default or named.
>
> 4. Some find() cases that are more important like find(G,S,?,?) find(G,?,P,O)
> [key look up] or find(G,?,P,?)
> find(G,?,?,?) is covered by (3)
>
> 5. and the non-G versions for a graph.
> *6. Union graph (if supported)
>
> Given those, I think the next level of verification is real use, rather than
> specific (artificial) situations. Of course, there is also mega-sized
> in-memory use cases (systems can deploy at lot of RAM these days). Then GC
> and/or off heap memory starts getting fun.
>
> Andy