Hello,

And here we go:

        
https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L557

Filters always go first and then within those categories, sorted by their 
selectivity (i.e. endCounts / startCounts).

Marko.

http://markorodriguez.com

On Jun 23, 2015, at 9:37 AM, Marko Rodriguez <[email protected]> wrote:

> Hi,
> 
> Last night I was reading on the iPad about how databases do query 
> optimization. I have never studied this area of computing so I wanted to get 
> comfortable with the terminology and general techniques. Unfortunately, 
> everything on the topic is a PDF on Elsevier, Springer, etc. And of course, 
> thick and math'd out… Fortunately, I found this neat web page:
> 
>       
> https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/SPARQL%20Optimization
>               
> https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/Implementing%20Custom%20Optimizers
> 
> Here are some things I liked:
> 
>       1. "Evaluate FILTER at the earliest possible point."
>               - we currently do this in CountMatchAlgorithm where all 
> "where()" patterns are sorted to the top.
>                       
> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L546
>       2. "Evaluate the most selective patterns first."
>               - we currently "half do this" as we can only evaluate patterns 
> that have the specified start-label.
>               - we can make this better by preferring those that also have 
> the specified end-label.
>                       
> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L536
>  (required labels are always start-labels)
> 
> On some other webpage somewhere, someone talked about the "selectivity of 
> clauses" as a function of how much data goes in and how much data comes out. 
> We gather this information with MatchAlgorith.recordStart() and 
> MatchAlgorithm.recordEnd(). 
>       
> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L543
>               - we currently only record how much data comes out in 
> CountMatchAlgorithm.
> 
> With both in and out counts, we have a cost function:
> 
>       filter % -- where()-patterns: out / in
>               11 traversers enter, 3 traverser leave -- filter % = 0.27
>       growth % -- match()-patterns: out / in
>               4 traverser enter, 5000 traverser leave -- growth % = 1250.00
> 
> In essence, the %'s are still just based on counts. Thus, CountMatchAlgorithm 
> could have the following struct.
> 
> Record {
>   Traversal<?,?> traversal 
>       // the pattern
>   short dependencyCluster 
>       // some patterns require the output of other patterns (a 
> start-end-label dependency tree is calculated in 
> CountMatchAlgorithm.initialize().)
>   char patternType 
>       // w (where) and m (match)      
>   float cost 
>       // filter and growth %
> }
> 
> Now the global sorting of these records is (in Gremlin notation for ease of 
> readability):
> 
>       order.by(dependencyCluster).thenBy(patternType).thenBy(cost)
> 
> At the moment of execution for a particular traverser, it would have this 
> global ordering. However, it would want to do (2) above and pick a pattern 
> that it has both a start and end for -- thus, it would simply iterate down 
> the list of traversals until it found one it had both start/end-keys for -- 
> if none, then first one it had a start key for.
> 
> Probably should have just implemented it instead of writing an email as its 
> pretty trivial to do.
> 
> Take care,
> Marko.
> 
> http://markorodriguez.com
> 

Reply via email to