Re: BSBM With Triples and Mapped Relational Data in Virtuoso

Chris Bizer Thu, 07 Aug 2008 02:35:52 -0700


Hi Orri and Ivan,

Consequently, we need to show that mapping can outperform an RDF
warehouse, which is what we'll do here.

Yes. I was already guessing for a while that SPARQL against RDF-mappedrelational DBs should be faster than SPARQL against triple stores.With D2R Server it turned out that some queries are much faster, butalso that D2R Server really performas bad on others (especially Q5).The bad performance with some queries was no surprise as there isstill lots of room for improvements in D2R Servers SPARQL-to-SQL queryrewriting algorithm.Another observation was that the distance between native RDF storesand RDF-mapped RDBs increases with dataset size.So it looks like that if you have more than 50M triples and schematathat somehow fits into a RDB, you should go for the RDF solution.

We also see that the advantage of mapping can be further increased
by more compiler optimizations, so we expect in the end mapping will
lead RDF warehousing by a factor of 4 or so.

Being able to show a factor 4 on all dataset sizes would be veryinteresting!

Suggestions for BSBM

* Reporting Rules. The benchmark spec should specify a form for
 disclosure of test run data, TPC style. This includes things like
 configuration parameters and exact text of queries. There should
 be accepted variants of query text, as with the TPC.

We have started collecting stuff that should go into thefull-disclosure report in section 6.2 of the benchmark spechttp://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html#reportingbut did not had the time to define a proper format for this yet (Iguess we will have some XML format). We will define the format forversion 2 of the benchmark, which will be released together withupdated results in about 3-4 weeks.

If you think that there is something missing from this list, pleaselet us know.

* Multiuser operation. The test driver should get a stream number as
 parameter, so that each client makes a different query sequence.
 Also, disk performance in this type of benchmark can only be
 reasonably assessed with a naturally parallel multiuser workload.

Yes. This is already on our todo list and will also be part of thenext release.

* Add business intelligence. SPARQL has aggregates now, at least
 with Jena and Virtuoso, so let's use these. The BSBM business
 intelligence metric should be a separate metric off the same data.
 Adding synthetic sales figures would make more interesting queries
 possible. For example, producing recommendations like "customers
 who bought this also bought xxx."

Hmm, yes and no. I would love to extend the benchmark with a BI querymix, but aggregates are not yet an official part of SPARQL. Our goalwith the benchmark was to define a tool to compare stores thatimplement the current SPARQL specs but not to fix these specs. Thus,we stayed in the bounderies of the current spec and of couse ran intoall the know problems of SPARQL (no aggregates, no free-text search,no proper negation). All these things were discussed at the SPARQL 2BOF at WWW2008 and I hope that they are all on Ivan Herman's list forthe charter of a new SPARQL WG.

* For the SPARQL community, BSBM sends the message that one ought to
 support parameterized queries and stored procedures. This would be
 a SPARQL protocol extension; the SPARUL syntax should also have a
 way of calling a procedure. Something like select proc (??, ??)
 would be enough, where ?? is a parameter marker, like ? in
 ODBC/JDBC.

Also a great idea and maybe something Ivan does not have on his listyet.

* Add transactions.Especially if we are contrasting mapping vs.
 storing triples, having an update flow is relevant. In practice,
 this could be done by having the test driver send web service
 requests for order entry and the SUT could implement these as
 updates to the triples or a mapped relational store. This could
 use stored procedures or logic in an app server.

In principle yes, but we also wanted to design a benchmark that somecurrent RDF stores are able to run.If I look at the current data load times of the SUTs I'm not so surethat they like update streams ;-)

But I agree that update streams are clearly something that we shouldhave in the future.

Comments on Query Mix

The time of most queries is less than linear to the scale factor. Q6
is an exception if it is not implemented using a text index. Without
the text index, Q6 will inevitably come to dominate query time asthescale is increased, and thus will make the benchmark less relevantat
larger scales.

You are right and it is again a problem of us trying to stay in thebounderies of the SPARQL spec.No sane person would use a regex for this kind of free-text search,but SPARQL only offers the regex function and nothing else.

Maybe we should be a bit less strict here and allow proprietaryvariants of Q6 until SPARQL got fixed.

Next

We include the sources of our RDF view definitions and othermaterial

for running BSBM with our forthcoming Virtuoso Open Source 5.0.8
release. This also includes all the query optimization work done for
BSBM. This will be available in the coming days.

Great. We are looking forward to rerun the benchmark with the newvirtuoso release on our box. Especially being able to confirm thefactor 4 advance of RDF-mapped RDFs against RDF stores would be fun;-)


Cheers

Chris and Andreas



- Orri





[1]
<http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1409>

[2]
<http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html>

[3] <http://www.w3.org/2005/Incubator/rdb2rdf/>

[4]
<http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1400>

Re: BSBM With Triples and Mapped Relational Data in Virtuoso

Reply via email to