[ https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pere Ferrera Bertran updated SPARK-3461: ---------------------------------------- Comment: was deleted (was: Hi [~rxin], does this mean that the current DataFrames have already no memory limitation for a key when doing a groupBy? Is the "scalable" group by + secondary sort achieved by dataFrame.orderBy(...).groupBy(...)? Trying to find some more detailed information about this.) > Support external groupByKey using repartitionAndSortWithinPartitions > -------------------------------------------------------------------- > > Key: SPARK-3461 > URL: https://issues.apache.org/jira/browse/SPARK-3461 > Project: Spark > Issue Type: Bug > Components: Spark Core > Reporter: Patrick Wendell > Assignee: Reynold Xin > Priority: Critical > Fix For: 1.6.0 > > > Given that we have SPARK-2978, it seems like we could support an external > group by operator pretty easily. We'd just have to wrap the existing iterator > exposed by SPARK-2978 with a lookahead iterator that detects the group > boundaries. Also, we'd have to override the cache() operator to cache the > parent RDD so that if this object is cached it doesn't wind through the > iterator. > I haven't totally followed all the sort-shuffle internals, but just given the > stated semantics of SPARK-2978 it seems like this would be possible. > It would be really nice to externalize this because many beginner users write > jobs in terms of groupByKey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org