[jira] [Commented] (SPARK-15798) Secondary sort in Dataset/DataFrame

koert kuipers (JIRA) Sun, 13 Nov 2016 10:51:06 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15661918#comment-15661918
 ]


koert kuipers commented on SPARK-15798:
---------------------------------------

turns out the operations needed for this are already mostly available in 
Dataset. the one big limitation is that it seems the secondary sort does not 
get pushed into the shuffle in spark sql (but it is done efficiently with 
spilling to disk etc.). see this conversation:
https://www.mail-archive.com/[email protected]/msg58844.html

i added support for Dataset secondary sort to spark-sorted. see here:
https://github.com/tresata/spark-sorted

i would also like to add support for DataFrame but to do so i would need 
operations to convert Row to UDF inputs and back, which in spark sql are 
available (Encoder, ScalaReflection, etc.) but they support InternalRow only 
while in a 3rd party library i need to work with normal rows since InternalRows 
are never exposed (for example in Dataset[Row].mapPartitions i have Rows but 
not InternalRows).

> Secondary sort in Dataset/DataFrame
> -----------------------------------
>
>                 Key: SPARK-15798
>                 URL: https://issues.apache.org/jira/browse/SPARK-15798
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: koert kuipers
>
> Secondary sort for Spark RDDs was discussed in 
> https://issues.apache.org/jira/browse/SPARK-3655
> Since the RDD API allows for easy extensions outside the core library this 
> was implemented separately here:
> https://github.com/tresata/spark-sorted
> However it seems to me that with Dataset an implementation in a 3rd party 
> library of such a feature is not really an option.
> Dataset already has methods that suggest a secondary sort is present, such as 
> in KeyValueGroupedDataset:
> {noformat}
> def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): 
> Dataset[U]
> {noformat}
> This operation pushes all the data to the reducer, something you only would 
> want to do if you need the elements in a particular order.
> How about as an API sortBy methods in KeyValueGroupedDataset and 
> RelationalGroupedDataset?
> {noformat}
> dataFrame.groupBy("a").sortBy("b").fold(...)
> {noformat}
> (yes i know RelationalGroupedDataset doesnt have a fold yet... but it should 
> :))
> {noformat}
> dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-15798) Secondary sort in Dataset/DataFrame

Reply via email to