[ 
https://issues.apache.org/jira/browse/TINKERPOP3-866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942404#comment-14942404
 ] 

Marko A. Rodriguez commented on TINKERPOP3-866:
-----------------------------------------------

Okay. So I have this working, however, the syntax is a little wacky UNLESS we 
change the {{group()}}-API. And in a major way. Check it.

{code}
g.V().group().by(label).by(bothE().values("weight").fold()).by(dedup(Scope.local))
 // old way
g.V().group().by(label).by(bothE().values("weight").fold()).by(unfold().dedup().fold())
 // new way
{code}

This {{unfold().x.fold()}}-business is ugly. Its necessary because we are not 
creating one big master collection to pass to the reducer like we did in "the 
old way." In the new way, we have a bunch of individual collections being fed 
to the reducer as they are being produced. This is the memory/time savings -- 
however, it makes the syntax lame.

What is interesting to note is that I simply take the {{valueTraversal}} and 
the {{reduceTraversal}} and concatenate them inside of {{GroupStep}}. (sorta). 
If we make the explicit, then we can have a {{group()}} with only 2 
{{by()}}-modulators -- key/value. Watch:

{code}
g.V().group().by(label).by(bothE().values("weight").dedup().fold())) // new new 
way
{code}

Thus, in this new new model, all you are saying is "what is the key and what is 
the traversal that I will pass all the key'd traversers through." Where, if you 
do:

{code}
g.V().group().by(label)
{code}

This implicitly means:

{code}
g.V().group().by(label).by(fold())
{code}

This would be a massive backwards compatibility issue and we can't just 
{{@Deprecate}} because we need the {{group()}} name.

Again, the benefit is that we can no reduce "on the fly" and not build up a 
massive Collection that is then reduced. You will note that {{BarrierStep}} has 
a new method called {{processAllStarts}} which basically does all the 
{{while(true) next()}} to mutate the seed without actually emitting the seed. 
This way, every time we add new traversers to the "reduce traversal", we 
process them through the first {{BarrierStep}} (e.g. max, min, sum, count, 
fold, etc.).

Thoughts?

> GroupStep and Traversal-Based Reductions
> ----------------------------------------
>
>                 Key: TINKERPOP3-866
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP3-866
>             Project: TinkerPop 3
>          Issue Type: Improvement
>          Components: process
>    Affects Versions: 3.0.1-incubating
>            Reporter: Marko A. Rodriguez
>            Assignee: Marko A. Rodriguez
>              Labels: breaking
>             Fix For: 3.1.0-incubating
>
>
> Right now {{GroupStep}} is defined as:
> {code}
> public final class GroupStep<S, K, V, R> extends ReducingBarrierStep<S, 
> Map<K, R>> implements MapReducer, TraversalParent {
>     private Traversal.Admin<S, K> keyTraversal = null;
>     private Traversal.Admin<S, V> valueTraversal = null;
>     private Traversal.Admin<Collection<V>, R> reduceTraversal = null;
> ...
> {code}
> Look at {{reduceTraversal}}. It takes a {{Collection<V>}} of "values" and 
> reduces them to a "reduction" {{R}}. Why are we using {{Collection<V>}}, why 
> is this not:
> {code}
> private Traversal.Admin<V, R> reduceTraversal = null;
> {code}
> Now, when a new {{K}} is created (and reduce is defined), we clone 
> {{reduceTraversal}}. Thus, each key has a {{reduceTraversal}} (identical 
> clones) that operate in a stream like fashion on {{V}} to yield {{R}}. This 
> enables us to remove the {{Collection<V>}} (memory hog) and allows us to 
> defined {{GroupCountStep}} in terms of {{GroupStep}} without (?limited?) 
> computational cost. HOWEVER, this changes the API as people who did this:
> {code}
> g.V.group.by(label()).by(outE().count()).by(sum(local))
> {code}
> would now have to do this:
> {code}
> g.V.group.by(label()).by(outE().count()).by(sum())
> {code}
> Its very minor, given the speed up we would gain and the ability for us to 
> now do "groupCount" efficiently on arbitrary values -- not just bulks (e.g. 
> sacks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to