[
https://issues.apache.org/jira/browse/TINKERPOP-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804224#comment-15804224
]
ASF GitHub Bot commented on TINKERPOP-1585:
-------------------------------------------
Github user dkuppitz commented on a diff in the pull request:
https://github.com/apache/tinkerpop/pull/524#discussion_r94925652
--- Diff:
gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/DedupGlobalStep.java
---
@@ -89,6 +92,17 @@ public ElementRequirement getMaxRequirement() {
@Override
protected Traverser.Admin<S> processNextStart() {
+ if (null != this.barrier) {
+ this.barrierIterator = this.barrier.entrySet().iterator();
+ this.barrier = null;
+ }
+ while (this.barrierIterator != null &&
this.barrierIterator.hasNext()) {
+ if (null == this.barrierIterator)
--- End diff --
`this.barrierIterator` can never be null within within the `while()` loop.
Unless I overlooked something fundamental, `processNextStart` can be simplified
to:
```
protected Traverser.Admin<S> processNextStart() {
if (null != this.barrier) {
this.barrierIterator = this.barrier.entrySet().iterator();
this.barrier = null;
while (this.barrierIterator.hasNext()) {
final Map.Entry<Object, Traverser.Admin<S>> entry =
this.barrierIterator.next();
if (this.duplicateSet.add(entry.getKey()))
return
PathProcessor.processTraverserPathLabels(entry.getValue(), this.keepLabels);
}
}
return
PathProcessor.processTraverserPathLabels(super.processNextStart(),
this.keepLabels);
}
```
> OLAP dedup over non elements
> ----------------------------
>
> Key: TINKERPOP-1585
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1585
> Project: TinkerPop
> Issue Type: Bug
> Components: hadoop, process
> Affects Versions: 3.2.3
> Reporter: Daniel Kuppitz
> Assignee: Marko A. Rodriguez
>
> OLAP {{dedup()}} is highly inefficient when it's fed with non elements.
> In a customer project a query similar tho the following returned a result in
> slightly more than 6 seconds:
> {noformat}
> persistedRDD.
> V().hasLabel("label1","label2").
> inE("edgeLabel1","edgeLabel2").outV().
> id().count()
> {noformat}
> The same query with {{dedup()}} added:
> {noformat}
> persistedRDD.
> V().hasLabel("label1","label2").
> inE("edgeLabel1","edgeLabel2").outV().
> id().dedup().count()
> {noformat}
> ...took more than 120 seconds.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)