Hi Mariano,

On 04/12/11 13:54, Mariano Rodriguez wrote:
Hi all,

We are now benchmarking several triple stores that support inference
through forward chaining against a system that does a particular form
of query rewriting.

The benchmark we are using is simple, an extended version of LUBM,
using big datasets LUBM 1000, 8000, 15000, 250000. From Jena we would
like to benchmark loading time, inference time and query answering
time, using both TDB and SDB. Inferences should be done with limited
amounts of memory, the less the better. However, we are having
difficulties understanding what is the fair way to do this. Also, the
system used for this benchmarks should be a simple system, not a
cluster or a server with large resources. We would like to ask the
community for help to approach this in the best way possible. Hence
this email :). Here go some questions and ideas.
>
> Is it the case that the default inference engine of Jena requires all
> triples to be in-memory? Is it not possible to do this on this? If
> this is so, what would be the fair way to benchmark the system?
There are a couple of dimensions to think about:

1/ Do you want to test LUBM or a more general data?
2/ What level of inference do you wish to test?

(1) => For LUBM, there are no inference across universities so you can generate the data for one university, run the forward chain inference on it and move on to the next university knowing that no triples will be generated later that affect the university you have just processed (and so don't need to retain state for it).

(2) => Inference for LUBM only needs one data triple and access to the ontology to calculate the inferences. Once a triple has been processed, to can emit the inferred triples and move on. Again, no data-related state is needed.

The Jena rules-based reasoner, which is RETE-based, is more powerful than is need for RDFS or LUBM, including rules based on multiple data triples and retraction, but the cost is that it stores internal state in-memory scaling with the size of the data.

There is also a stream-based forward chaining engine, riotcmd.infer, that keeps the RSF schema in memory but not the state of the data so it uses a fixed amount of space and does not increase with data size.

This is probably the best way to infer over LUBM at scale.

Right
now we are thinking of a workflow as follows:

1. Start a TDB or SDB store.
> 2. Load 10 LUBMS in memory, compute the
closure using

Reasoner reasoner = ReasonerRegistry.getOWLReasoner(); InfModel inf =
ModelFactory.createInfModel(reasoner, monto, m);

and storing the result in SDB or TDB. When finished,
> 3. Query the store directly.
>
Is this the most efficient way to do it? Are there important
parameters (besides the number of universities used in the
computation of the closure) that we should tune to guarantee a fair
evaluation? Are there any documents that we could use to guide
ourselfs during tuning of Jena?

This is exploiting the features of LUBM (you only need one university). I don't have figures I'd expect the riotcmd.infer to be faster as it's less general.

The flow is:

infer --rdfs=VOCAB DATA | tdbloader2 --loc DB

on a 64bit system.  Linux is faster than Windows.

(tdbloader2 only runs on linux currently - Paolo has a pure java version on github)

Thank you very much in advance everybody,

Best regards, Mariano

        Good luck,
        Andy




Mariano Rodriguez Muro http://www.inf.unibz.it/~rodriguez/ KRDB
Research Center Faculty of Computer Science Free University of
Bozen-Bolzano (FUB) Piazza Domenicani 3, I-39100 Bozen-Bolzano BZ,
Italy 猴





Reply via email to