Nice

Glad to see I occasionally write some documentation that people find useful

Rob

On 23/06/2015 21:01, "Marko Rodriguez" <[email protected]> wrote:

>Hi,
>
>Sorry for the SPAM. However, here are some runtimes of the new
>CountMatchAglorithm vs. old CountMatchAlgorithm vs. old MatchStep.
>
>       https://gist.github.com/okram/d49e1abf48fdc18f77f9
>
>Sweet,
>Marko.
>
>http://markorodriguez.com
>
>On Jun 23, 2015, at 1:54 PM, Marko Rodriguez <[email protected]> wrote:
>
>> Hello,
>> 
>> And here we go:
>> 
>> 
>>      https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/s
>>rc/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/Matc
>>hStep.java#L557
>> 
>> Filters always go first and then within those categories, sorted by
>>their selectivity (i.e. endCounts / startCounts).
>> 
>> Marko.
>> 
>> http://markorodriguez.com
>> 
>> On Jun 23, 2015, at 9:37 AM, Marko Rodriguez <[email protected]>
>>wrote:
>> 
>>> Hi,
>>> 
>>> Last night I was reading on the iPad about how databases do query
>>>optimization. I have never studied this area of computing so I wanted
>>>to get comfortable with the terminology and general techniques.
>>>Unfortunately, everything on the topic is a PDF on Elsevier, Springer,
>>>etc. And of course, thick and math'd out… Fortunately, I found this
>>>neat web page:
>>> 
>>> 
>>>     https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/SP
>>>ARQL%20Optimization
>>> 
>>>             
>>> https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/I
>>>mplementing%20Custom%20Optimizers
>>> 
>>> Here are some things I liked:
>>> 
>>>     1. "Evaluate FILTER at the earliest possible point."
>>>             - we currently do this in CountMatchAlgorithm where all 
>>> "where()"
>>>patterns are sorted to the top.
>>> 
>>>                     
>>> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-cor
>>>e/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/
>>>MatchStep.java#L546
>>>     2. "Evaluate the most selective patterns first."
>>>             - we currently "half do this" as we can only evaluate patterns 
>>> that
>>>have the specified start-label.
>>>             - we can make this better by preferring those that also have the
>>>specified end-label.
>>> 
>>>                     
>>> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-cor
>>>e/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/
>>>MatchStep.java#L536 (required labels are always start-labels)
>>> 
>>> On some other webpage somewhere, someone talked about the "selectivity
>>>of clauses" as a function of how much data goes in and how much data
>>>comes out. We gather this information with MatchAlgorith.recordStart()
>>>and MatchAlgorithm.recordEnd().
>>> 
>>>     https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/
>>>src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/Ma
>>>tchStep.java#L543
>>>             - we currently only record how much data comes out in
>>>CountMatchAlgorithm.
>>> 
>>> With both in and out counts, we have a cost function:
>>> 
>>>     filter % -- where()-patterns: out / in
>>>             11 traversers enter, 3 traverser leave -- filter % = 0.27
>>>     growth % -- match()-patterns: out / in
>>>             4 traverser enter, 5000 traverser leave -- growth % = 1250.00
>>> 
>>> In essence, the %'s are still just based on counts. Thus,
>>>CountMatchAlgorithm could have the following struct.
>>> 
>>> Record {
>>>   Traversal<?,?> traversal
>>>     // the pattern
>>>   short dependencyCluster
>>>     // some patterns require the output of other patterns (a
>>>start-end-label dependency tree is calculated in
>>>CountMatchAlgorithm.initialize().)
>>>   char patternType
>>>     // w (where) and m (match)
>>>   float cost 
>>>     // filter and growth %
>>> }
>>> 
>>> Now the global sorting of these records is (in Gremlin notation for
>>>ease of readability):
>>> 
>>>     order.by(dependencyCluster).thenBy(patternType).thenBy(cost)
>>> 
>>> At the moment of execution for a particular traverser, it would have
>>>this global ordering. However, it would want to do (2) above and pick a
>>>pattern that it has both a start and end for -- thus, it would simply
>>>iterate down the list of traversals until it found one it had both
>>>start/end-keys for -- if none, then first one it had a start key for.
>>> 
>>> Probably should have just implemented it instead of writing an email
>>>as its pretty trivial to do.
>>> 
>>> Take care,
>>> Marko.
>>> 
>>> http://markorodriguez.com
>>> 
>> 
>




Reply via email to