[ 
https://issues.apache.org/jira/browse/SPARK-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341727#comment-14341727
 ] 

Sean Owen commented on SPARK-3312:
----------------------------------

Interesting, is the reduce / max / min in question here by key? We have the 
{{stats()}} method for RDDs of {{Double}} already to take care of this for a 
whole RDD. Rather than add an API method for the by-key case, it's possible to 
use {{StatCounter}} to compute all of these at once over a bunch of values that 
have been collected by key. Does that do the trick or is this something more?

> Add a groupByKey which returns a special GroupBy object like in pandas
> ----------------------------------------------------------------------
>
>                 Key: SPARK-3312
>                 URL: https://issues.apache.org/jira/browse/SPARK-3312
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: holdenk
>            Priority: Minor
>
> A common pattern which causes problems for new Spark users is using 
> groupByKey followed by a reduce. I'd like to make a special version of 
> groupByKey which returns a groupBy object (like the Panda's groupby object). 
> The resulting class would have a number of functions (min,max, stats, reduce) 
> which could all be implemented efficiently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to