Hi,
Sorry for the SPAM. However, here are some runtimes of the new
CountMatchAglorithm vs. old CountMatchAlgorithm vs. old MatchStep.
https://gist.github.com/okram/d49e1abf48fdc18f77f9
Sweet,
Marko.
http://markorodriguez.com
On Jun 23, 2015, at 1:54 PM, Marko Rodriguez <[email protected]> wrote:
> Hello,
>
> And here we go:
>
>
> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L557
>
> Filters always go first and then within those categories, sorted by their
> selectivity (i.e. endCounts / startCounts).
>
> Marko.
>
> http://markorodriguez.com
>
> On Jun 23, 2015, at 9:37 AM, Marko Rodriguez <[email protected]> wrote:
>
>> Hi,
>>
>> Last night I was reading on the iPad about how databases do query
>> optimization. I have never studied this area of computing so I wanted to get
>> comfortable with the terminology and general techniques. Unfortunately,
>> everything on the topic is a PDF on Elsevier, Springer, etc. And of course,
>> thick and math'd out… Fortunately, I found this neat web page:
>>
>>
>> https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/SPARQL%20Optimization
>>
>> https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/Implementing%20Custom%20Optimizers
>>
>> Here are some things I liked:
>>
>> 1. "Evaluate FILTER at the earliest possible point."
>> - we currently do this in CountMatchAlgorithm where all
>> "where()" patterns are sorted to the top.
>>
>> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L546
>> 2. "Evaluate the most selective patterns first."
>> - we currently "half do this" as we can only evaluate patterns
>> that have the specified start-label.
>> - we can make this better by preferring those that also have
>> the specified end-label.
>>
>> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L536
>> (required labels are always start-labels)
>>
>> On some other webpage somewhere, someone talked about the "selectivity of
>> clauses" as a function of how much data goes in and how much data comes out.
>> We gather this information with MatchAlgorith.recordStart() and
>> MatchAlgorithm.recordEnd().
>>
>> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L543
>> - we currently only record how much data comes out in
>> CountMatchAlgorithm.
>>
>> With both in and out counts, we have a cost function:
>>
>> filter % -- where()-patterns: out / in
>> 11 traversers enter, 3 traverser leave -- filter % = 0.27
>> growth % -- match()-patterns: out / in
>> 4 traverser enter, 5000 traverser leave -- growth % = 1250.00
>>
>> In essence, the %'s are still just based on counts. Thus,
>> CountMatchAlgorithm could have the following struct.
>>
>> Record {
>> Traversal<?,?> traversal
>> // the pattern
>> short dependencyCluster
>> // some patterns require the output of other patterns (a
>> start-end-label dependency tree is calculated in
>> CountMatchAlgorithm.initialize().)
>> char patternType
>> // w (where) and m (match)
>> float cost
>> // filter and growth %
>> }
>>
>> Now the global sorting of these records is (in Gremlin notation for ease of
>> readability):
>>
>> order.by(dependencyCluster).thenBy(patternType).thenBy(cost)
>>
>> At the moment of execution for a particular traverser, it would have this
>> global ordering. However, it would want to do (2) above and pick a pattern
>> that it has both a start and end for -- thus, it would simply iterate down
>> the list of traversals until it found one it had both start/end-keys for --
>> if none, then first one it had a start key for.
>>
>> Probably should have just implemented it instead of writing an email as its
>> pretty trivial to do.
>>
>> Take care,
>> Marko.
>>
>> http://markorodriguez.com
>>
>