[
https://issues.apache.org/jira/browse/STORM-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064194#comment-16064194
]
Arun Mahadevan commented on STORM-2258:
---------------------------------------
We already have a groupByKey implementation. What we are looking for is a
coGroupByKey which is more like a join but instead of returning the cross
product of the matching keys, groups together values for the same key from the
joined streams. For example,
Say stream1 has values - (k1, v1), (k2, v2), (k2, v3)
and stream2 has values - (k1, x1), (k1, x2), (k3, x3)
The the co-grouped stream would contain -
{noformat}
(k1, ([v1], [x1, x2]), (k2, ([v2, v3], [])), (k3, ([], [x3]))
{noformat}
Since you are co-grouping two streams containing key-value pairs, you would
define the operation on the PairStream class, with the signature of the method
to be something like,
{noformat}
public <V1> PairStream<K, Pair<Iterable<V>, Iterable<V1>>>
coGroupByKey(PairStream<K, V1> otherStream) {
// ...
}
{noformat}
To get some hints on how to go about implementing this, you can take a look at
the implementation of the join operation. (see JoinProcessor.java)
> Streams api - support CoGroupByKey
> ----------------------------------
>
> Key: STORM-2258
> URL: https://issues.apache.org/jira/browse/STORM-2258
> Project: Apache Storm
> Issue Type: Sub-task
> Reporter: Arun Mahadevan
>
> Group together values with same key from both streams. Similar constructs are
> supported in beam, spark and flink.
> When called on a Stream of (K, V) and (K, W) pairs, return a new Stream of
> (K, Seq[V], Seq[W]) tuples
> See also - https://cloud.google.com/dataflow/model/group-by-key
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)