[
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated SPARK-3461:
-----------------------------
Target Version/s: (was: 1.2.0)
> Support external groupByKey using repartitionAndSortWithinPartitions
> --------------------------------------------------------------------
>
> Key: SPARK-3461
> URL: https://issues.apache.org/jira/browse/SPARK-3461
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Reporter: Patrick Wendell
> Assignee: Sandy Ryza
> Priority: Critical
>
> Given that we have SPARK-2978, it seems like we could support an external
> group by operator pretty easily. We'd just have to wrap the existing iterator
> exposed by SPARK-2978 with a lookahead iterator that detects the group
> boundaries. Also, we'd have to override the cache() operator to cache the
> parent RDD so that if this object is cached it doesn't wind through the
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write
> jobs in terms of groupByKey.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]