[jira] [Comment Edited] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

koert kuipers (JIRA) Sun, 07 Dec 2014 19:25:35 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237299#comment-14237299
 ]


koert kuipers edited comment on SPARK-3655 at 12/8/14 3:24 AM:
---------------------------------------------------------------

i have a new pullreq that implements just groupByKeyAndSortValues in scala and 
java. i will need some help with python.

pullreq is here:
https://github.com/apache/spark/pull/3632

i changed methods to return RDD[(K, TraversableOnce[V])] instead of RDD[(K, 
Iterable[V])], since i dont see a reasonable way to implement it so that it 
returns Iterables without resorting to keeping the data in memory.
The assumption made is that once you move on to the next key within a partition 
that the previous value (so the TraversableOnce[V]) will no longer be used.

I personally find this API too generic, and too easy to abuse or make mistakes 
with. So i prefer a more constrained API like foldLeft. Or perhaps 
groupByKeyAndSortValues could be DeveloperAPI?



was (Author: koert):
i have a new pullreq that implements just groupByKeyAndSortValues in scala and 
java. i will need some help with python.

pullreq is here:
https://github.com/apache/spark/pull/3632

i changed methods to return RDD[(K, TraversableOnce[V])] instead of RDD[(K, 
Iterable[V])], since i dont see a reasonable way to implement it so that it 
returns Iterables without resorting to keeping the data in memory.
The assumption made is that once you move on to the next key within a partition 
that the previous value (so the TraversableOnce[V]) will no longer be used.

I personally find this API too generic, and too easy to abuse or make mistakes 
with. So i prefer a more constrained API like foldLeft.


> Support sorting of values in addition to keys (i.e. secondary sort)
> -------------------------------------------------------------------
>
>                 Key: SPARK-3655
>                 URL: https://issues.apache.org/jira/browse/SPARK-3655
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: koert kuipers
>            Assignee: Koert Kuipers
>            Priority: Minor
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

Reply via email to