[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

Imran Rashid (JIRA) Sat, 20 Dec 2014 12:03:32 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254942#comment-14254942
 ]


Imran Rashid commented on SPARK-3655:
-------------------------------------

Hey Koert,

good questions about the types, I hadn't really thought about it yet.  I guess 
I'm actually proposing 3 type parameters -- the row type doesn't change at all, 
but there are additional types for the partitioning and sorting.

val x: RDD[X] = ...
val y: SortedRDD[X,K,V] = x.groupAndSort(f1, f2)

so then you'd have

mapPartitions[Y](f: Iterator[X] => Iterator[Y]): RDD[Y]

mapGroup[Y](f: (K, Iterator[X]) => Iterator[Y]): RDD[Y]

foldByKey[Y](zero:Y)(f: (Y, X) => Y): RDD[Y]

or maybe the return type of mapGroup & foldByKey would be RDD[(K,Seq[Y])] or 
something ... or there is another variant which would let you return another 
SortedRDD.  probably need to try out some variants and see how they look.

Having three type parameters is a little unwieldy ... maybe we don't even 
bother keeping the types K & V if they don't actually get us anything.  Eg. I 
dont' think you actually need to expose the type V at all.  You really just 
need to keep an Ordering[X] as a member variable.  Then groupAndSort takes an X 
=> V and constructs an Ordering[X] out of it.

yeah I dunno about name either ... PartitionSortedRdd?  GroupSortedRdd? ...

Glad you are interested in this and think an implementation would be easy.  I 
was actually going to suggest that maybe I'm proposing a bigger change, so it 
should come after the existing work you've done.  Especially since I'm really 
proposing adding some new apis for even basic partitioning & grouping, even 
without involving secondary sort at all ...

> Support sorting of values in addition to keys (i.e. secondary sort)
> -------------------------------------------------------------------
>
>                 Key: SPARK-3655
>                 URL: https://issues.apache.org/jira/browse/SPARK-3655
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: koert kuipers
>            Assignee: Koert Kuipers
>            Priority: Minor
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

Reply via email to