[
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237299#comment-14237299
]
koert kuipers edited comment on SPARK-3655 at 12/8/14 3:24 AM:
---------------------------------------------------------------
i have a new pullreq that implements just groupByKeyAndSortValues in scala and
java. i will need some help with python.
pullreq is here:
https://github.com/apache/spark/pull/3632
i changed methods to return RDD[(K, TraversableOnce[V])] instead of RDD[(K,
Iterable[V])], since i dont see a reasonable way to implement it so that it
returns Iterables without resorting to keeping the data in memory.
The assumption made is that once you move on to the next key within a partition
that the previous value (so the TraversableOnce[V]) will no longer be used.
I personally find this API too generic, and too easy to abuse or make mistakes
with. So i prefer a more constrained API like foldLeft. Or perhaps
groupByKeyAndSortValues could be DeveloperAPI?
was (Author: koert):
i have a new pullreq that implements just groupByKeyAndSortValues in scala and
java. i will need some help with python.
pullreq is here:
https://github.com/apache/spark/pull/3632
i changed methods to return RDD[(K, TraversableOnce[V])] instead of RDD[(K,
Iterable[V])], since i dont see a reasonable way to implement it so that it
returns Iterables without resorting to keeping the data in memory.
The assumption made is that once you move on to the next key within a partition
that the previous value (so the TraversableOnce[V]) will no longer be used.
I personally find this API too generic, and too easy to abuse or make mistakes
with. So i prefer a more constrained API like foldLeft.
> Support sorting of values in addition to keys (i.e. secondary sort)
> -------------------------------------------------------------------
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
> Issue Type: New Feature
> Components: Spark Core
> Affects Versions: 1.1.0, 1.2.0
> Reporter: koert kuipers
> Assignee: Koert Kuipers
> Priority: Minor
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon?
> There are some use cases where getting a sorted iterator of values per key is
> helpful.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]