Hi,
Last night I was reading on the iPad about how databases do query optimization.
I have never studied this area of computing so I wanted to get comfortable with
the terminology and general techniques. Unfortunately, everything on the topic
is a PDF on Elsevier, Springer, etc. And of course, thick and math'd out…
Fortunately, I found this neat web page:
https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/SPARQL%20Optimization
https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/Implementing%20Custom%20Optimizers
Here are some things I liked:
1. "Evaluate FILTER at the earliest possible point."
- we currently do this in CountMatchAlgorithm where all
"where()" patterns are sorted to the top.
https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L546
2. "Evaluate the most selective patterns first."
- we currently "half do this" as we can only evaluate patterns
that have the specified start-label.
- we can make this better by preferring those that also have
the specified end-label.
https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L536
(required labels are always start-labels)
On some other webpage somewhere, someone talked about the "selectivity of
clauses" as a function of how much data goes in and how much data comes out. We
gather this information with MatchAlgorith.recordStart() and
MatchAlgorithm.recordEnd().
https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L543
- we currently only record how much data comes out in
CountMatchAlgorithm.
With both in and out counts, we have a cost function:
filter % -- where()-patterns: out / in
11 traversers enter, 3 traverser leave -- filter % = 0.27
growth % -- match()-patterns: out / in
4 traverser enter, 5000 traverser leave -- growth % = 1250.00
In essence, the %'s are still just based on counts. Thus, CountMatchAlgorithm
could have the following struct.
Record {
Traversal<?,?> traversal
// the pattern
short dependencyCluster
// some patterns require the output of other patterns (a
start-end-label dependency tree is calculated in
CountMatchAlgorithm.initialize().)
char patternType
// w (where) and m (match)
float cost
// filter and growth %
}
Now the global sorting of these records is (in Gremlin notation for ease of
readability):
order.by(dependencyCluster).thenBy(patternType).thenBy(cost)
At the moment of execution for a particular traverser, it would have this
global ordering. However, it would want to do (2) above and pick a pattern that
it has both a start and end for -- thus, it would simply iterate down the list
of traversals until it found one it had both start/end-keys for -- if none,
then first one it had a start key for.
Probably should have just implemented it instead of writing an email as its
pretty trivial to do.
Take care,
Marko.
http://markorodriguez.com