Hi,

Last night I was reading on the iPad about how databases do query optimization. 
I have never studied this area of computing so I wanted to get comfortable with 
the terminology and general techniques. Unfortunately, everything on the topic 
is a PDF on Elsevier, Springer, etc. And of course, thick and math'd out… 
Fortunately, I found this neat web page:

        
https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/SPARQL%20Optimization
                
https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/Implementing%20Custom%20Optimizers

Here are some things I liked:

        1. "Evaluate FILTER at the earliest possible point."
                - we currently do this in CountMatchAlgorithm where all 
"where()" patterns are sorted to the top.
                        
https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L546
        2. "Evaluate the most selective patterns first."
                - we currently "half do this" as we can only evaluate patterns 
that have the specified start-label.
                - we can make this better by preferring those that also have 
the specified end-label.
                        
https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L536
 (required labels are always start-labels)

On some other webpage somewhere, someone talked about the "selectivity of 
clauses" as a function of how much data goes in and how much data comes out. We 
gather this information with MatchAlgorith.recordStart() and 
MatchAlgorithm.recordEnd(). 
        
https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L543
                - we currently only record how much data comes out in 
CountMatchAlgorithm.

With both in and out counts, we have a cost function:

        filter % -- where()-patterns: out / in
                11 traversers enter, 3 traverser leave -- filter % = 0.27
        growth % -- match()-patterns: out / in
                4 traverser enter, 5000 traverser leave -- growth % = 1250.00

In essence, the %'s are still just based on counts. Thus, CountMatchAlgorithm 
could have the following struct.

Record {
  Traversal<?,?> traversal 
        // the pattern
  short dependencyCluster 
        // some patterns require the output of other patterns (a 
start-end-label dependency tree is calculated in 
CountMatchAlgorithm.initialize().)
  char patternType 
        // w (where) and m (match)      
  float cost 
        // filter and growth %
}

Now the global sorting of these records is (in Gremlin notation for ease of 
readability):

        order.by(dependencyCluster).thenBy(patternType).thenBy(cost)

At the moment of execution for a particular traverser, it would have this 
global ordering. However, it would want to do (2) above and pick a pattern that 
it has both a start and end for -- thus, it would simply iterate down the list 
of traversals until it found one it had both start/end-keys for -- if none, 
then first one it had a start key for.

Probably should have just implemented it instead of writing an email as its 
pretty trivial to do.

Take care,
Marko.

http://markorodriguez.com

Reply via email to