[ https://issues.apache.org/jira/browse/SPARK-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341727#comment-14341727 ]
Sean Owen commented on SPARK-3312: ---------------------------------- Interesting, is the reduce / max / min in question here by key? We have the {{stats()}} method for RDDs of {{Double}} already to take care of this for a whole RDD. Rather than add an API method for the by-key case, it's possible to use {{StatCounter}} to compute all of these at once over a bunch of values that have been collected by key. Does that do the trick or is this something more? > Add a groupByKey which returns a special GroupBy object like in pandas > ---------------------------------------------------------------------- > > Key: SPARK-3312 > URL: https://issues.apache.org/jira/browse/SPARK-3312 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Reporter: holdenk > Priority: Minor > > A common pattern which causes problems for new Spark users is using > groupByKey followed by a reduce. I'd like to make a special version of > groupByKey which returns a groupBy object (like the Panda's groupby object). > The resulting class would have a number of functions (min,max, stats, reduce) > which could all be implemented efficiently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org