[jira] [Updated] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

Sean Owen (JIRA) Tue, 05 May 2015 23:54:36 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen updated SPARK-3461:
-----------------------------
    Target Version/s:   (was: 1.2.0)

> Support external groupByKey using repartitionAndSortWithinPartitions
> --------------------------------------------------------------------
>
>                 Key: SPARK-3461
>                 URL: https://issues.apache.org/jira/browse/SPARK-3461
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Patrick Wendell
>            Assignee: Sandy Ryza
>            Priority: Critical
>
> Given that we have SPARK-2978, it seems like we could support an external 
> group by operator pretty easily. We'd just have to wrap the existing iterator 
> exposed by SPARK-2978 with a lookahead iterator that detects the group 
> boundaries. Also, we'd have to override the cache() operator to cache the 
> parent RDD so that if this object is cached it doesn't wind through the 
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the 
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write 
> jobs in terms of groupByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

Reply via email to