[jira] [Updated] (SPARK-3461) Support external groupBy and groupByKey using repartitionAndSortWithinPartitions

Patrick Wendell (JIRA) Sun, 05 Oct 2014 23:13:38 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Patrick Wendell updated SPARK-3461:
-----------------------------------
    Description: 
Given that we have SPARK-2978, it seems like we could support an external group 
by operator pretty easily. We'd just have to wrap the existing iterator exposed 
by SPARK-2978 with a lookahead iterator that detects the group boundaries. 
Also, we'd have to override the cache() operator to cache the parent RDD so 
that if this object is cached it doesn't wind through the iterator.

I haven't totally followed all the sort-shuffle internals, but just given the 
stated semantics of SPARK-2978 it seems like this would be possible.

It would be really nice to externalize this because many beginner users write 
jobs in terms of groupBy and groupByKey

  was:
Given that we have SPARK-2978, it seems like we could support an external group 
by operator pretty easily. We'd just have to wrap the existing iterator exposed 
by SPARK-2978 with a lookahead iterator that detects the group boundaries. 
Also, we'd have to override the cache() operator to cache the parent RDD so 
that if this object is cached it doesn't wind through the iterator.

I haven't totally followed all the sort-shuffle internals, but just given the 
stated semantics of SPARK-2978 it seems like this would be possible.

It would be really nice to externalize this because many beginner users write 
jobs in terms of groupBy.


> Support external groupBy and groupByKey using 
> repartitionAndSortWithinPartitions
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-3461
>                 URL: https://issues.apache.org/jira/browse/SPARK-3461
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Patrick Wendell
>            Assignee: Davies Liu
>
> Given that we have SPARK-2978, it seems like we could support an external 
> group by operator pretty easily. We'd just have to wrap the existing iterator 
> exposed by SPARK-2978 with a lookahead iterator that detects the group 
> boundaries. Also, we'd have to override the cache() operator to cache the 
> parent RDD so that if this object is cached it doesn't wind through the 
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the 
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write 
> jobs in terms of groupBy and groupByKey



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-3461) Support external groupBy and groupByKey using repartitionAndSortWithinPartitions

Reply via email to