[rdflib-dev] Major re-write of MySQL triple pattern evaluation

Chimezie Ogbuji Thu, 01 Nov 2007 11:43:07 -0800

John Clark and myself recently finished a major round of optimizations
to the MySQL store.  In a nutshell, SPARQL queries such as this:


SELECT *
{
  :foo :prop1 ?a.
  ?b :prop2 ?a.
  :baz :prop3 ?b
}

Used to be evaluated a single triple pattern at a time, accumulating
the results at the client (per the original sparql-p implementation).
Now, when evaluating a SPARQL query such as above against the MySQL
store, the 3 triple patterns will be evaluated at once at the server.
So essentially, the MySQL store can now do server-side joins of Basic
Graph Patterns.

This has a significant effect on query performance for the following reasons:

- The unification of the variables between the three triple patterns
are now performed using the MySQL database (which is highly optimized
for this sort of thing) rather than doing it at the client (very
inefficiently)
- The amount of intermediate results a query service has to contend
with is significantly reduced
- The granularity of results that can be streamed back from the
database via a lazy cursor is significantly larger.
- This maximizes the efficiency of SPARQL queries which consist of
conjunctions only (which can be solved in O(|P|.|D|) time - i.e.,
linear complexity - per analysis by Perez, P, et. al [1]).  In
layman's terms: SPARQL queries with this simple form will be *very*
scalable

The changes introduce a new method to the MySQL store:
batch_unify(self, patterns).  The docstring reads:

"""
Perform RDF triple store-level unification of a list of triple
patterns (4-item tuples which correspond to a SPARQL triple pattern
with an additional constraint for the graph name).  For the MySQL
backend, this method compiles the list of triple patterns into SQL
statements that obtain bindings for all the variables in the list of
triples patterns.

        :Parameters:
        - `patterns`: a list of 4-item tuples where any of the items can be
                one of: Variable, URIRef, BNode, or Literal.

        Returns a generator over dictionaries of solutions to the list of
        triple patterns.  Each dictionary binds the variables in the triple
        patterns to the correct values for those variables.

        For more on unification see:
        http://en.wikipedia.org/wiki/Unification
"""

So, this functionality can be invoked programmatically instead of via
SPARQL.  In addition, the solutions returned (a dictionary of
variables to terms) is a lazy generator over the result set (using a
lazy MySQLdb cursor), so even for queries with ridiculously large
result sets, the answers can be iterated over at the user's whim.

The Store superclass / stub was updated with a batch_unification
attribute (False by default) which can be used by other stores to
indicate if they can perform server-side unification.  This is used by
the SPARQL engine to determine if it should attempt to offload entire
BGP unifications to the server.

Finally, as an additional optimization, the MySQL store has an
additional set of attributes:

        self.literal_properties = set()
        '''set of URIRefs of those RDF properties which are known to range
        over literals.'''
        self.resource_properties = set()
        '''set of URIRefs of those RDF properties which are known to range
        over resources.'''

This allows the MySQL store to only search the tables that are
relevant based on the domain / range of properties used in the
patterns passed on to batch_unification.  So, a particular SPARQL
query service can be fully optimized for the vocabulary it uses in
it's dataset.

-- Chimezie

[1] http://iswc2006.semanticweb.org/items/Arenas2006bv.pdf
_______________________________________________
Dev mailing list
Dev@rdflib.net
http://rdflib.net/mailman/listinfo/dev

[rdflib-dev] Major re-write of MySQL triple pattern evaluation

Reply via email to