[
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048609#comment-15048609
]
Pere Ferrera Bertran commented on SPARK-3461:
---------------------------------------------
What's the status of this?
We provided a hack on top of the current API for Java users here, in case it's
interesting for people hitting this issue:
http://www.datasalt.com/2015/12/a-scalable-groupbykey-and-secondary-sort-for-java-spark/
> Support external groupByKey using repartitionAndSortWithinPartitions
> --------------------------------------------------------------------
>
> Key: SPARK-3461
> URL: https://issues.apache.org/jira/browse/SPARK-3461
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Reporter: Patrick Wendell
> Assignee: Sandy Ryza
> Priority: Critical
>
> Given that we have SPARK-2978, it seems like we could support an external
> group by operator pretty easily. We'd just have to wrap the existing iterator
> exposed by SPARK-2978 with a lookahead iterator that detects the group
> boundaries. Also, we'd have to override the cache() operator to cache the
> parent RDD so that if this object is cached it doesn't wind through the
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write
> jobs in terms of groupByKey.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]