Re: Learning from a SPARQL Query Optimizer webpage.

Marko Rodriguez Wed, 24 Jun 2015 06:28:19 -0700

Hey Rob,

I didn't know you wrote that page. Thanks for doing so.


You don't know how hard it is to find good practical information on database 
query optimizers. Yesterday I was like "let me just go to Amazon and buy a 
'theory and practice of query optimization'-style book." Nothing. Bunch of 
weird "Optimize your MS Access Queries in 5 Minutes!"-type books and then a 
bunch of "A relational tuple projection over the space of reals is a manifold 
isomorphic to the tensor representation of that same tuple within the 
Levinsky-Portrokov metric space"-style books.

If you know of any solid books that are not too "Oracle DB2 does it like this…" 
and more general to any designed query language and data structure, that would 
be a nice tid-bit to share.

Thanks again,
Marko.

http://markorodriguez.com

On Jun 24, 2015, at 1:05 AM, Rob Vesse <[email protected]> wrote:

> Nice
> 
> Glad to see I occasionally write some documentation that people find useful
> 
> Rob
> 
> On 23/06/2015 21:01, "Marko Rodriguez" <[email protected]> wrote:
> 
>> Hi,
>> 
>> Sorry for the SPAM. However, here are some runtimes of the new
>> CountMatchAglorithm vs. old CountMatchAlgorithm vs. old MatchStep.
>> 
>>      https://gist.github.com/okram/d49e1abf48fdc18f77f9
>> 
>> Sweet,
>> Marko.
>> 
>> http://markorodriguez.com
>> 
>> On Jun 23, 2015, at 1:54 PM, Marko Rodriguez <[email protected]> wrote:
>> 
>>> Hello,
>>> 
>>> And here we go:
>>> 
>>> 
>>>     https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/s
>>> rc/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/Matc
>>> hStep.java#L557
>>> 
>>> Filters always go first and then within those categories, sorted by
>>> their selectivity (i.e. endCounts / startCounts).
>>> 
>>> Marko.
>>> 
>>> http://markorodriguez.com
>>> 
>>> On Jun 23, 2015, at 9:37 AM, Marko Rodriguez <[email protected]>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Last night I was reading on the iPad about how databases do query
>>>> optimization. I have never studied this area of computing so I wanted
>>>> to get comfortable with the terminology and general techniques.
>>>> Unfortunately, everything on the topic is a PDF on Elsevier, Springer,
>>>> etc. And of course, thick and math'd out… Fortunately, I found this
>>>> neat web page:
>>>> 
>>>> 
>>>>    https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/SP
>>>> ARQL%20Optimization
>>>> 
>>>>            
>>>> https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/I
>>>> mplementing%20Custom%20Optimizers
>>>> 
>>>> Here are some things I liked:
>>>> 
>>>>    1. "Evaluate FILTER at the earliest possible point."
>>>>            - we currently do this in CountMatchAlgorithm where all 
>>>> "where()"
>>>> patterns are sorted to the top.
>>>> 
>>>>                    
>>>> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-cor
>>>> e/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/
>>>> MatchStep.java#L546
>>>>    2. "Evaluate the most selective patterns first."
>>>>            - we currently "half do this" as we can only evaluate patterns 
>>>> that
>>>> have the specified start-label.
>>>>            - we can make this better by preferring those that also have the
>>>> specified end-label.
>>>> 
>>>>                    
>>>> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-cor
>>>> e/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/
>>>> MatchStep.java#L536 (required labels are always start-labels)
>>>> 
>>>> On some other webpage somewhere, someone talked about the "selectivity
>>>> of clauses" as a function of how much data goes in and how much data
>>>> comes out. We gather this information with MatchAlgorith.recordStart()
>>>> and MatchAlgorithm.recordEnd().
>>>> 
>>>>    https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/
>>>> src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/Ma
>>>> tchStep.java#L543
>>>>            - we currently only record how much data comes out in
>>>> CountMatchAlgorithm.
>>>> 
>>>> With both in and out counts, we have a cost function:
>>>> 
>>>>    filter % -- where()-patterns: out / in
>>>>            11 traversers enter, 3 traverser leave -- filter % = 0.27
>>>>    growth % -- match()-patterns: out / in
>>>>            4 traverser enter, 5000 traverser leave -- growth % = 1250.00
>>>> 
>>>> In essence, the %'s are still just based on counts. Thus,
>>>> CountMatchAlgorithm could have the following struct.
>>>> 
>>>> Record {
>>>>  Traversal<?,?> traversal
>>>>    // the pattern
>>>>  short dependencyCluster
>>>>    // some patterns require the output of other patterns (a
>>>> start-end-label dependency tree is calculated in
>>>> CountMatchAlgorithm.initialize().)
>>>>  char patternType
>>>>    // w (where) and m (match)      
>>>>  float cost 
>>>>    // filter and growth %
>>>> }
>>>> 
>>>> Now the global sorting of these records is (in Gremlin notation for
>>>> ease of readability):
>>>> 
>>>>    order.by(dependencyCluster).thenBy(patternType).thenBy(cost)
>>>> 
>>>> At the moment of execution for a particular traverser, it would have
>>>> this global ordering. However, it would want to do (2) above and pick a
>>>> pattern that it has both a start and end for -- thus, it would simply
>>>> iterate down the list of traversals until it found one it had both
>>>> start/end-keys for -- if none, then first one it had a start key for.
>>>> 
>>>> Probably should have just implemented it instead of writing an email
>>>> as its pretty trivial to do.
>>>> 
>>>> Take care,
>>>> Marko.
>>>> 
>>>> http://markorodriguez.com
>>>> 
>>> 
>> 
> 
> 
> 
>

Re: Learning from a SPARQL Query Optimizer webpage.

Reply via email to