Re: Learning from a SPARQL Query Optimizer webpage.

Marko Rodriguez Tue, 23 Jun 2015 13:02:55 -0700

Hi,

Sorry for the SPAM. However, here are some runtimes of the new 
CountMatchAglorithm vs. old CountMatchAlgorithm vs. old MatchStep.


        https://gist.github.com/okram/d49e1abf48fdc18f77f9

Sweet,
Marko.

http://markorodriguez.com

On Jun 23, 2015, at 1:54 PM, Marko Rodriguez <[email protected]> wrote:

> Hello,
> 
> And here we go:
> 
>       
> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L557
> 
> Filters always go first and then within those categories, sorted by their 
> selectivity (i.e. endCounts / startCounts).
> 
> Marko.
> 
> http://markorodriguez.com
> 
> On Jun 23, 2015, at 9:37 AM, Marko Rodriguez <[email protected]> wrote:
> 
>> Hi,
>> 
>> Last night I was reading on the iPad about how databases do query 
>> optimization. I have never studied this area of computing so I wanted to get 
>> comfortable with the terminology and general techniques. Unfortunately, 
>> everything on the topic is a PDF on Elsevier, Springer, etc. And of course, 
>> thick and math'd out… Fortunately, I found this neat web page:
>> 
>>      
>> https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/SPARQL%20Optimization
>>              
>> https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/Implementing%20Custom%20Optimizers
>> 
>> Here are some things I liked:
>> 
>>      1. "Evaluate FILTER at the earliest possible point."
>>              - we currently do this in CountMatchAlgorithm where all 
>> "where()" patterns are sorted to the top.
>>                      
>> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L546
>>      2. "Evaluate the most selective patterns first."
>>              - we currently "half do this" as we can only evaluate patterns 
>> that have the specified start-label.
>>              - we can make this better by preferring those that also have 
>> the specified end-label.
>>                      
>> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L536
>>  (required labels are always start-labels)
>> 
>> On some other webpage somewhere, someone talked about the "selectivity of 
>> clauses" as a function of how much data goes in and how much data comes out. 
>> We gather this information with MatchAlgorith.recordStart() and 
>> MatchAlgorithm.recordEnd(). 
>>      
>> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MatchStep.java#L543
>>              - we currently only record how much data comes out in 
>> CountMatchAlgorithm.
>> 
>> With both in and out counts, we have a cost function:
>> 
>>      filter % -- where()-patterns: out / in
>>              11 traversers enter, 3 traverser leave -- filter % = 0.27
>>      growth % -- match()-patterns: out / in
>>              4 traverser enter, 5000 traverser leave -- growth % = 1250.00
>> 
>> In essence, the %'s are still just based on counts. Thus, 
>> CountMatchAlgorithm could have the following struct.
>> 
>> Record {
>>   Traversal<?,?> traversal 
>>      // the pattern
>>   short dependencyCluster 
>>      // some patterns require the output of other patterns (a 
>> start-end-label dependency tree is calculated in 
>> CountMatchAlgorithm.initialize().)
>>   char patternType 
>>      // w (where) and m (match)      
>>   float cost 
>>      // filter and growth %
>> }
>> 
>> Now the global sorting of these records is (in Gremlin notation for ease of 
>> readability):
>> 
>>      order.by(dependencyCluster).thenBy(patternType).thenBy(cost)
>> 
>> At the moment of execution for a particular traverser, it would have this 
>> global ordering. However, it would want to do (2) above and pick a pattern 
>> that it has both a start and end for -- thus, it would simply iterate down 
>> the list of traversals until it found one it had both start/end-keys for -- 
>> if none, then first one it had a start key for.
>> 
>> Probably should have just implemented it instead of writing an email as its 
>> pretty trivial to do.
>> 
>> Take care,
>> Marko.
>> 
>> http://markorodriguez.com
>> 
>

Re: Learning from a SPARQL Query Optimizer webpage.

Reply via email to