Firstly, the best execution strategy depends on the join algorithms available, the "shape" of the basic graph patterns and the data frequencies, and indexes available not just the frequencies alone.

Foe example several other proposals in the RDF space uses 6 way indexing so that every order is available and then merge join (without sort) is possible.

For TDB (1 and 2) we have a stats system:

https://jena.apache.org/documentation/tdb/optimizer.html

and the weighted reorder interface ReorderTransformationSubstitution
ReorderWeighted uses stats.

The default is ReorderFixed - in practice this does quite well because people tend to write the selective parts first without, especially inverse-functional-properties first (and they are 2-way grounded). It avoids rdf:type because that is sometimes very unselective.

The LeapFrogJon paper ("A Worst-Case Optimal Join Algorithm for SPARQL" Aidan Hogan, Cristian Riveros, Carlos Rojas and Adrian Soto) at ISWC 2019 has some interesting ideas - noting that is it "best worse case" not "ideal". The identification of "lonely variables" would help patterns with star shape sections which currently incur repeated work. That can be achieved by caching accesses.


Viewing it as a entity problem (varaible length tuples key'ed by subject), not a triples problem, is also an intersting avenue.


Statistics are gathered ahead-of-time:

For Histograms - see also:
http://theory.stanford.edu/~matias/papers/vldb97_tods.pdf

and HyperLogLog as mentioned on the https://datasketches.github.io/docs/ site.

    Andy

On 31/12/2019 13:58, Claude Warren wrote:
Background:
I have been preparing to give a talk on DataSketches at FOSDEM 2020.  I am
doing this for an acquaintance and don't have much background in them but I
am learning.

Idea:
This is a sketch called the Frequent Distinct Tuples (FTD)[1] sketch.  What
it can do is estimate the number of occurrences of Tuples when the number
of occurrences is "frequent".  Ignoring all the hand waving for a moment,
and understanding that "frequent" is undefined in this discussion.

Would it make sense, inside the optimizer to be able to query to find out
which of various values occur most frequently so that the smallest possible
intermediate solutions can be built?  From my reading the FDT sketch can do
this.  So given a set of properties find out which ones occur most
frequently and use the others first.  Something like that.

DataSketches are in the Apache Incubator.  They look interesting and have
some interesting properties.  I am not sure how applicable they are to us
though.


[1]
https://datasketches.github.io/docs/Frequency/FrequentDistinctTuplesSketch.html

Claude

Reply via email to