Daniel Shields created SPARK-17416:
--------------------------------------
Summary: Add Dataset.groupByKey overload that takes a value
selector function
Key: SPARK-17416
URL: https://issues.apache.org/jira/browse/SPARK-17416
Project: Spark
Issue Type: New Feature
Reporter: Daniel Shields
I propose that the following overload be added to Dataset[T]:
def groupByKey[K, V](keyFunc: T => K, valueFunc: T => V)(implicit arg0:
Encoder[K], implicit arg1: Encoder[V])
This would simplify a number of use cases. For example, consider the following
classic MapReduce query:
rdd.flatMap(f).reduceByKey(g) // where f returns a list of tuples
An idiomatic way to write this with Spark 2.0 would be:
dataset.flatMap(f).groupByKey(_._1, _._2).reduceGroups(g)
Without the groupByKey overload suggested above, this must be written as:
dataset.flatMap(f).groupByKey(_._1).reduceGroups((a, b) => g(a._2, b._2))
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]