Hello,
And here we go:
https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L557
Filters always go first and then within those categories, sorted by their
selectivity (i.e. endCounts / startCounts).
Marko.
http://markorodriguez.com
On Jun 23, 2015, at 9:37 AM, Marko Rodriguez <[email protected]> wrote:
> Hi,
>
> Last night I was reading on the iPad about how databases do query
> optimization. I have never studied this area of computing so I wanted to get
> comfortable with the terminology and general techniques. Unfortunately,
> everything on the topic is a PDF on Elsevier, Springer, etc. And of course,
> thick and math'd out… Fortunately, I found this neat web page:
>
>
> https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/SPARQL%20Optimization
>
> https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/Implementing%20Custom%20Optimizers
>
> Here are some things I liked:
>
> 1. "Evaluate FILTER at the earliest possible point."
> - we currently do this in CountMatchAlgorithm where all
> "where()" patterns are sorted to the top.
>
> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L546
> 2. "Evaluate the most selective patterns first."
> - we currently "half do this" as we can only evaluate patterns
> that have the specified start-label.
> - we can make this better by preferring those that also have
> the specified end-label.
>
> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L536
> (required labels are always start-labels)
>
> On some other webpage somewhere, someone talked about the "selectivity of
> clauses" as a function of how much data goes in and how much data comes out.
> We gather this information with MatchAlgorith.recordStart() and
> MatchAlgorithm.recordEnd().
>
> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L543
> - we currently only record how much data comes out in
> CountMatchAlgorithm.
>
> With both in and out counts, we have a cost function:
>
> filter % -- where()-patterns: out / in
> 11 traversers enter, 3 traverser leave -- filter % = 0.27
> growth % -- match()-patterns: out / in
> 4 traverser enter, 5000 traverser leave -- growth % = 1250.00
>
> In essence, the %'s are still just based on counts. Thus, CountMatchAlgorithm
> could have the following struct.
>
> Record {
> Traversal<?,?> traversal
> // the pattern
> short dependencyCluster
> // some patterns require the output of other patterns (a
> start-end-label dependency tree is calculated in
> CountMatchAlgorithm.initialize().)
> char patternType
> // w (where) and m (match)
> float cost
> // filter and growth %
> }
>
> Now the global sorting of these records is (in Gremlin notation for ease of
> readability):
>
> order.by(dependencyCluster).thenBy(patternType).thenBy(cost)
>
> At the moment of execution for a particular traverser, it would have this
> global ordering. However, it would want to do (2) above and pick a pattern
> that it has both a start and end for -- thus, it would simply iterate down
> the list of traversals until it found one it had both start/end-keys for --
> if none, then first one it had a start key for.
>
> Probably should have just implemented it instead of writing an email as its
> pretty trivial to do.
>
> Take care,
> Marko.
>
> http://markorodriguez.com
>