John Clark and myself recently finished a major round of optimizations to the MySQL store. In a nutshell, SPARQL queries such as this:
SELECT * { :foo :prop1 ?a. ?b :prop2 ?a. :baz :prop3 ?b } Used to be evaluated a single triple pattern at a time, accumulating the results at the client (per the original sparql-p implementation). Now, when evaluating a SPARQL query such as above against the MySQL store, the 3 triple patterns will be evaluated at once at the server. So essentially, the MySQL store can now do server-side joins of Basic Graph Patterns. This has a significant effect on query performance for the following reasons: - The unification of the variables between the three triple patterns are now performed using the MySQL database (which is highly optimized for this sort of thing) rather than doing it at the client (very inefficiently) - The amount of intermediate results a query service has to contend with is significantly reduced - The granularity of results that can be streamed back from the database via a lazy cursor is significantly larger. - This maximizes the efficiency of SPARQL queries which consist of conjunctions only (which can be solved in O(|P|.|D|) time - i.e., linear complexity - per analysis by Perez, P, et. al [1]). In layman's terms: SPARQL queries with this simple form will be *very* scalable The changes introduce a new method to the MySQL store: batch_unify(self, patterns). The docstring reads: """ Perform RDF triple store-level unification of a list of triple patterns (4-item tuples which correspond to a SPARQL triple pattern with an additional constraint for the graph name). For the MySQL backend, this method compiles the list of triple patterns into SQL statements that obtain bindings for all the variables in the list of triples patterns. :Parameters: - `patterns`: a list of 4-item tuples where any of the items can be one of: Variable, URIRef, BNode, or Literal. Returns a generator over dictionaries of solutions to the list of triple patterns. Each dictionary binds the variables in the triple patterns to the correct values for those variables. For more on unification see: http://en.wikipedia.org/wiki/Unification """ So, this functionality can be invoked programmatically instead of via SPARQL. In addition, the solutions returned (a dictionary of variables to terms) is a lazy generator over the result set (using a lazy MySQLdb cursor), so even for queries with ridiculously large result sets, the answers can be iterated over at the user's whim. The Store superclass / stub was updated with a batch_unification attribute (False by default) which can be used by other stores to indicate if they can perform server-side unification. This is used by the SPARQL engine to determine if it should attempt to offload entire BGP unifications to the server. Finally, as an additional optimization, the MySQL store has an additional set of attributes: self.literal_properties = set() '''set of URIRefs of those RDF properties which are known to range over literals.''' self.resource_properties = set() '''set of URIRefs of those RDF properties which are known to range over resources.''' This allows the MySQL store to only search the tables that are relevant based on the domain / range of properties used in the patterns passed on to batch_unification. So, a particular SPARQL query service can be fully optimized for the vocabulary it uses in it's dataset. -- Chimezie [1] http://iswc2006.semanticweb.org/items/Arenas2006bv.pdf _______________________________________________ Dev mailing list Dev@rdflib.net http://rdflib.net/mailman/listinfo/dev