Nice Glad to see I occasionally write some documentation that people find useful
Rob On 23/06/2015 21:01, "Marko Rodriguez" <[email protected]> wrote: >Hi, > >Sorry for the SPAM. However, here are some runtimes of the new >CountMatchAglorithm vs. old CountMatchAlgorithm vs. old MatchStep. > > https://gist.github.com/okram/d49e1abf48fdc18f77f9 > >Sweet, >Marko. > >http://markorodriguez.com > >On Jun 23, 2015, at 1:54 PM, Marko Rodriguez <[email protected]> wrote: > >> Hello, >> >> And here we go: >> >> >> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/s >>rc/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/Matc >>hStep.java#L557 >> >> Filters always go first and then within those categories, sorted by >>their selectivity (i.e. endCounts / startCounts). >> >> Marko. >> >> http://markorodriguez.com >> >> On Jun 23, 2015, at 9:37 AM, Marko Rodriguez <[email protected]> >>wrote: >> >>> Hi, >>> >>> Last night I was reading on the iPad about how databases do query >>>optimization. I have never studied this area of computing so I wanted >>>to get comfortable with the terminology and general techniques. >>>Unfortunately, everything on the topic is a PDF on Elsevier, Springer, >>>etc. And of course, thick and math'd out… Fortunately, I found this >>>neat web page: >>> >>> >>> https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/SP >>>ARQL%20Optimization >>> >>> >>> https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/DeveloperGuide/SPARQL/I >>>mplementing%20Custom%20Optimizers >>> >>> Here are some things I liked: >>> >>> 1. "Evaluate FILTER at the earliest possible point." >>> - we currently do this in CountMatchAlgorithm where all >>> "where()" >>>patterns are sorted to the top. >>> >>> >>> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-cor >>>e/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/ >>>MatchStep.java#L546 >>> 2. "Evaluate the most selective patterns first." >>> - we currently "half do this" as we can only evaluate patterns >>> that >>>have the specified start-label. >>> - we can make this better by preferring those that also have the >>>specified end-label. >>> >>> >>> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-cor >>>e/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/ >>>MatchStep.java#L536 (required labels are always start-labels) >>> >>> On some other webpage somewhere, someone talked about the "selectivity >>>of clauses" as a function of how much data goes in and how much data >>>comes out. We gather this information with MatchAlgorith.recordStart() >>>and MatchAlgorithm.recordEnd(). >>> >>> https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/ >>>src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/Ma >>>tchStep.java#L543 >>> - we currently only record how much data comes out in >>>CountMatchAlgorithm. >>> >>> With both in and out counts, we have a cost function: >>> >>> filter % -- where()-patterns: out / in >>> 11 traversers enter, 3 traverser leave -- filter % = 0.27 >>> growth % -- match()-patterns: out / in >>> 4 traverser enter, 5000 traverser leave -- growth % = 1250.00 >>> >>> In essence, the %'s are still just based on counts. Thus, >>>CountMatchAlgorithm could have the following struct. >>> >>> Record { >>> Traversal<?,?> traversal >>> // the pattern >>> short dependencyCluster >>> // some patterns require the output of other patterns (a >>>start-end-label dependency tree is calculated in >>>CountMatchAlgorithm.initialize().) >>> char patternType >>> // w (where) and m (match) >>> float cost >>> // filter and growth % >>> } >>> >>> Now the global sorting of these records is (in Gremlin notation for >>>ease of readability): >>> >>> order.by(dependencyCluster).thenBy(patternType).thenBy(cost) >>> >>> At the moment of execution for a particular traverser, it would have >>>this global ordering. However, it would want to do (2) above and pick a >>>pattern that it has both a start and end for -- thus, it would simply >>>iterate down the list of traversals until it found one it had both >>>start/end-keys for -- if none, then first one it had a start key for. >>> >>> Probably should have just implemented it instead of writing an email >>>as its pretty trivial to do. >>> >>> Take care, >>> Marko. >>> >>> http://markorodriguez.com >>> >> >
