[ https://issues.apache.org/jira/browse/TINKERPOP-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15367673#comment-15367673 ]
ASF GitHub Bot commented on TINKERPOP-1254: ------------------------------------------- GitHub user twilmes opened a pull request: https://github.com/apache/tinkerpop/pull/358 TINKERPOP-1254 Support dropping traverser path information when it is no longer needed This PR adds support for path retraction to increase the likelihood of bulking in OLTP and OLAP modes. Traversal analysis is performed during the application of the PrunePathStrategy to identify labels that may be dropped at various points in the traversal. MatchStep also performs runtime analysis to determine which labels it can drop in addition to the labels identified during traversal strategy application. Here is a set of profiles showing the benefit of path dropping. These were generated with `TinkerGraphComputer` first against 3.2 and then TinkerPop-1254. **Note** that the times should not be compared here. The first was run on my anemic macbook which, and the second was run on an 8 core AWS m3.2xlarge instance that I've been using for testing. I'll be following up with more numbers on the same hardware but you can see here the dramatic drop in traversers with path pruning enabled. **TP 3.2** ``` gremlin> g.V().match(__.as('a').out().as('b'), __.as('b').out().as('c'), __.as('c').out().as('d')).select('d').count().profile() ==>Traversal Metrics Step Count Traversers Time (ms) % Dur ============================================================================================================= TinkerGraphStep(vertex,[]) 808 808 21.205 0.02 MatchStep(AND,[[MatchStartStep(a), ProfileStep,... 14465066 14465066 110317.813 85.88 MatchStartStep(a) 808 808 10975.797 VertexStep(OUT,vertex) 8049 8049 6953.071 MatchEndStep(b) 8049 8049 6727.167 MatchStartStep(b) 8049 7957 5461.242 VertexStep(OUT,vertex) 327370 327370 6782.024 MatchEndStep(c) 327370 327370 6238.268 MatchStartStep(c) 327370 326983 1771.773 VertexStep(OUT,vertex) 14465066 14465066 11489.228 MatchEndStep(d) 14465066 14465066 14301.313 SelectOneStep(d) 14465066 14465066 13752.966 10.71 CountGlobalStep 1 1 4363.667 3.40 >TOTAL - - 128455.652 - ``` **TinkerPop-1254** ``` gremlin> g.V().match(__.as('a').out().as('b'), __.as('b').out().as('c'), __.as('c').out().as('d')).select('d').count().profile() ==>Traversal Metrics Step Count Traversers Time (ms) % Dur ============================================================================================================= GraphStep(vertex,[]) 808 808 32.453 19.96 MatchStep(AND,[[MatchStartStep(a), ProfileStep,... 14465066 7510 89.463 55.01 MatchStartStep(a) 808 808 22.388 VertexStep(OUT,vertex) 8049 7957 85.493 MatchEndStep(b) (profiling ignored) 0.000 MatchStartStep(b) 8049 563 7.488 VertexStep(OUT,vertex) 327370 7561 19.548 MatchEndStep(c) (profiling ignored) 0.000 MatchStartStep(c) 327370 452 4.247 VertexStep(OUT,vertex) 14465066 7510 14.812 MatchEndStep(d) (profiling ignored) 0.000 SelectOneStep(d) 14465066 452 40.711 25.03 CountGlobalStep 1 1 0.001 0.00 >TOTAL - - 162.630 - ``` One thing that I did want to call out. I was seeing some strange errors in the `MutablePath.retract` method that I traced to multiple threads attempting to do a retract at the same time. I did not think this should happen or would be possible but I may be misunderstanding how `GraphComputer` executes. To fix this, I added a `synchronized` block. I was hesitant to do this but it fixed the problem and appears to have not slowed things down appreciably. There may be a better way around this though so any input would be welcome. Tests pass and I got a passing integration test last night but made a few changes since then. I'll follow up with new integration test results when they finish running today. VOTE: +1 You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/tinkerpop TINKERPOP-1254 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tinkerpop/pull/358.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #358 ---- commit 88c5d27bd4d0e2bb898be7f4c3a86007d6422174 Author: Ted Wilmes <twil...@gmail.com> Date: 2016-07-08T12:37:41Z TINKERPOP-1254 Support dropping traverser path information when it is no longer needed This commit adds support for path retraction to increase the likelihood of bulking in OLTP and OLAP modes. Traversal analysis is performed during the application of the PrunePathStrategy to identify labels that may be dropped at various points in the traversal. MatchStep also performs runtime analysis to determine which labels it can drop in addition to the labels identified during traversal strategy application. ---- > Support dropping traverser path information when it is no longer needed. > ------------------------------------------------------------------------ > > Key: TINKERPOP-1254 > URL: https://issues.apache.org/jira/browse/TINKERPOP-1254 > Project: TinkerPop > Issue Type: Improvement > Components: process > Affects Versions: 3.1.1-incubating > Reporter: Marko A. Rodriguez > Assignee: Ted Wilmes > > The most expensive traversals (especially in OLAP) are those that can not be > "bulked." There are various reasons why two traversers at the same object can > not be bulked, but the primary reason is {{PATH}} or {{LABELED_PATH}}. That > is, when the history of the traverser is required, the probability of two > traversers having the same history is low. > A key to making traversals more efficient is to do as a much as possible to > remove historic information from a traverser so it can get bulked. How does > one do this? > {code} > g.V.as('a').out().as('b').out().where(neq('a').and().neq('b')).both().name > {code} > The {{LABELED_PATH}} of "a" and "b" are required up to the {{where()}} and at > which point, at {{both()}}, they are no longer required. It would be smart to > support: > {code} > traverser.dropLabels(Set<String>) > traverser.dropPath() > {code} > We would then, via a {{TraversalOptimizationStrategy}} insert a step between > {{where()}} and {{both()}} called {{PathPruneStep}} which would be a > {{SideEffectStep}}. The strategy would know which labels were no longer > needed (via forward lookahead) and then do: > {code} > public class PathPruneStep { > final Set<String> dropLabels = ... > final boolean dropPath = ... > public void sideEffect(final Traverser<S> traverser) { > final Traverser<S> start = this.starts.next(); > if(this.dropPath) start.dropPath(); > else start.dropLabels(labels); > } > } > {code} > Again, the more we can prune historic path data no longer needed, the higher > the probability of bulking. Think about this in terms of {{match()}}. > {code} > g.V().match( > a.out.b, > b.out.c, > c.neq.a, > c.out.b, > ).select("a") > {code} > All we need is "a" at the end. Thus, once a pattern has been passed and no > future patterns require that label, drop it! > This idea is related to TINKERPOP-331, but I don't think we should deal with > manipulating the species. Thus, I think 331 is too "low level." -- This message was sent by Atlassian JIRA (v6.3.4#6332)