On 26 Aug 2009, at 11:30 AM, Dan Brickley wrote:
In case anyone missed this query on the semantic-web list...
Thanks Dan.
---------- Forwarded message ----------
From: Niklas Lindström <[email protected]>
Date: 2009/8/26
Subject: SPARQL performance for ORDER BY on large datasets
To: Semantic Web <[email protected]>
Hi all!
Hi Niklas,
Responding here, a month late, because I no longer follow the semantic-
w...@w3 list.
Is this -- ORDER BY performance -- a commonly known problem, and
considered an issue of importance (for academia and implementers
alike)?
Yes, I'd say it's a known problem. The main reason, as I see it, is
that ORDER BY in SPARQL is hard for implementations to do efficiently.
* SPARQL defines its own ordering for values. That means an
implementation which uses some other ordering (either for
implementation-specific reasons such as compression, or because it
supports other query languages, efficient typed range queries as does
AllegroGraph, or whatever) is at a natural disadvantage compared to
SQL DBs. A SQL DB can just walk an appropriate index and avoid sorting
the results altogether, because its internal representations align
with SQL; that's usually not true of a triple store that implements
SPARQL. (SPARQL is young.)
* SPARQL ordering is *extensible*. Even an implementation that uses
SPARQL ordering natively for its indices is screwed if you define your
own operator mapping for <. I'd be surprised if any implementation
rebuilt custom indices every time a user defined a new operator mapping.
* Descending and ascending orders can be an annoyance if indices are
built to be walked in one direction (by metaindexing and binary
search, for example).
If you can't guarantee an in-order walk (which, as I mentioned above,
you usually can't), the semantics of SPARQL essentially require that
the implementation generate all possible results, sorted, prior to
applying the LIMIT.
I see this a lot: a customer wonders why adding LIMIT 10 to their
large, ordered query doesn't cause it to run in a millisecond. Of
course, all the results must still be computed. (In this case, even
computing only the ?modified bindings is the majority of the work, so
partitioning the query into "must run to sort" and "can run after
LIMIT" doesn't help.)
In short: there are a number of interactions between SPARQL and RDF,
and design decisions in SPARQL itself, which make it tiresome or
impossible as an implementor to make these queries fast. As an
industry we simply haven't had the time to push the specs towards
efficiency, to learn tricks, or fine-tune. Contrast this to the
decades 'enjoyed' by the SQL crowd. SQL and relational databases have
evolved together for many years.
This is arguably a point against the massive stack of XMLey, webby
specs that make up the Semantic Web layer cake: all of the semantic
translations between these things make *every* query, *every* load
that little bit slower. Defining SPARQL operations using XML Schema
Datatypes as a base, for example, puts XSD datatype promotion logic
into every (apparently simple) numeric comparison. Oof.
Or am I missing something really obvious in the setup of these stores,
or in my query? I welcome *any* suggestions, such as "use triple store
X", "for X, make sure to configure indexing on Y". Or do RDF-using
service builders in general opt out to indexing in something else
entirely in these cases?
In AllegroGraph, I'd advise something like the following:
* Store your modified times as encoded values, not RDF literals.
Simply defining a datatype mapping for :date-time = xsd:dateTime prior
to loading will suffice. That will immediately make the ordering
comparisons several times faster.
* Ensure your store is completely indexed. Novice AllegroGraph users
often omit this vital step. It doesn't sound like you did, but best to
check.
* Use range queries to bound the results. If you know that there are
more than 100 results between the dawn of time and 1990, let the query
engine know:
FILTER (?modified < "1990-01-01T00:00:00Z")
Armed with that information, the store can reduce the amount of work
it has to do in the ORDER BY stage by eliminating results outside the
time window. (If you use encoded datetimes, this FILTER will directly
affect the index walk.)
You might need to trick the planner to get it to do the right thing,
but that's easy enough.
Feel free to contact me directly if you'd like more advice on this.
You could even build a tree of approximate numbers of entries per
year, load the data into different named graphs based on year, run
multiple queries, whatever. There are workarounds, but that doesn't
address your main point.
(It seems queries like this are present in the Berlin SPARQL Benchmark
(e.g. #8), but I haven't analyzed this correlation and possible
meanings of it in depth.)
My personal opinion: the BSBM serves a limited purpose for people
evaluating triple stores, but strikes me as very SQL-ey in style: the
data are the opposite of sparse, and it's not a network. Relational
databases are a much, much better fit for this problem, and thus it's
not very interesting. It's a little benchmarking how well an Excel
spreadsheet can do pixel animation: sure, you can do it, but there are
other tools which are both mature and more suitable, so why bother?
(Then again, I'm a "right tool for the job" guy; I use a whole toolbox
of languages, utilities, and services, and I don't expect any one to
effectively substitute for another. It makes me laugh when I see
people trying to store 100MB documents in a database or triple store.)
However, SPARQL itself is a very relational language, lacking any kind
of graph-walking, transitivity etc. support, which makes it almost
intentionally unsuited to working with sparse, graph-like networks of
assertions. Make of this what you will.
Regards,
-Richard