Re: SPARQL performance for ORDER BY on large datasets

Richard Newman Fri, 02 Oct 2009 19:17:26 -0700

On  26 Aug 2009, at 11:30 AM, Dan Brickley wrote:

In case anyone missed this query on the semantic-web list...


Thanks Dan.

---------- Forwarded message ----------
From: Niklas Lindström <[email protected]>
Date: 2009/8/26
Subject: SPARQL performance for ORDER BY on large datasets
To: Semantic Web <[email protected]>


Hi all!


Hi Niklas,

Responding here, a month late, because I no longer follow the semantic-w...@w3 list.

Is this -- ORDER BY performance -- a commonly known problem, and
considered an issue of importance (for academia and implementers
alike)?

Yes, I'd say it's a known problem. The main reason, as I see it, isthat ORDER BY in SPARQL is hard for implementations to do efficiently.

* SPARQL defines its own ordering for values. That means animplementation which uses some other ordering (either forimplementation-specific reasons such as compression, or because itsupports other query languages, efficient typed range queries as doesAllegroGraph, or whatever) is at a natural disadvantage compared toSQL DBs. A SQL DB can just walk an appropriate index and avoid sortingthe results altogether, because its internal representations alignwith SQL; that's usually not true of a triple store that implementsSPARQL. (SPARQL is young.)

* SPARQL ordering is *extensible*. Even an implementation that usesSPARQL ordering natively for its indices is screwed if you define yourown operator mapping for <. I'd be surprised if any implementationrebuilt custom indices every time a user defined a new operator mapping.

* Descending and ascending orders can be an annoyance if indices arebuilt to be walked in one direction (by metaindexing and binarysearch, for example).

If you can't guarantee an in-order walk (which, as I mentioned above,you usually can't), the semantics of SPARQL essentially require thatthe implementation generate all possible results, sorted, prior toapplying the LIMIT.

I see this a lot: a customer wonders why adding LIMIT 10 to theirlarge, ordered query doesn't cause it to run in a millisecond. Ofcourse, all the results must still be computed. (In this case, evencomputing only the ?modified bindings is the majority of the work, sopartitioning the query into "must run to sort" and "can run afterLIMIT" doesn't help.)

In short: there are a number of interactions between SPARQL and RDF,and design decisions in SPARQL itself, which make it tiresome orimpossible as an implementor to make these queries fast. As anindustry we simply haven't had the time to push the specs towardsefficiency, to learn tricks, or fine-tune. Contrast this to thedecades 'enjoyed' by the SQL crowd. SQL and relational databases haveevolved together for many years.

This is arguably a point against the massive stack of XMLey, webbyspecs that make up the Semantic Web layer cake: all of the semantictranslations between these things make *every* query, *every* loadthat little bit slower. Defining SPARQL operations using XML SchemaDatatypes as a base, for example, puts XSD datatype promotion logicinto every (apparently simple) numeric comparison. Oof.

Or am I missing something really obvious in the setup of these stores,
or in my query? I welcome *any* suggestions, such as "use triple store
X", "for X, make sure to configure indexing on Y". Or do RDF-using
service builders in general opt out to indexing in something else
entirely in these cases?


In AllegroGraph, I'd advise something like the following:

* Store your modified times as encoded values, not RDF literals.Simply defining a datatype mapping for :date-time = xsd:dateTime priorto loading will suffice. That will immediately make the orderingcomparisons several times faster.

* Ensure your store is completely indexed. Novice AllegroGraph usersoften omit this vital step. It doesn't sound like you did, but best tocheck.

* Use range queries to bound the results. If you know that there aremore than 100 results between the dawn of time and 1990, let the queryengine know:


  FILTER (?modified < "1990-01-01T00:00:00Z")

Armed with that information, the store can reduce the amount of workit has to do in the ORDER BY stage by eliminating results outside thetime window. (If you use encoded datetimes, this FILTER will directlyaffect the index walk.)

You might need to trick the planner to get it to do the right thing,but that's easy enough.


Feel free to contact me directly if you'd like more advice on this.

You could even build a tree of approximate numbers of entries peryear, load the data into different named graphs based on year, runmultiple queries, whatever. There are workarounds, but that doesn'taddress your main point.

(It seems queries like this are present in the Berlin SPARQL Benchmark
(e.g. #8), but I haven't analyzed this correlation and possible
meanings of it in depth.)

My personal opinion: the BSBM serves a limited purpose for peopleevaluating triple stores, but strikes me as very SQL-ey in style: thedata are the opposite of sparse, and it's not a network. Relationaldatabases are a much, much better fit for this problem, and thus it'snot very interesting. It's a little benchmarking how well an Excelspreadsheet can do pixel animation: sure, you can do it, but there areother tools which are both mature and more suitable, so why bother?

(Then again, I'm a "right tool for the job" guy; I use a whole toolboxof languages, utilities, and services, and I don't expect any one toeffectively substitute for another. It makes me laugh when I seepeople trying to store 100MB documents in a database or triple store.)

However, SPARQL itself is a very relational language, lacking any kindof graph-walking, transitivity etc. support, which makes it almostintentionally unsuited to working with sparse, graph-like networks ofassertions. Make of this what you will.


Regards,

-Richard

Re: SPARQL performance for ORDER BY on large datasets

Reply via email to