[ 
https://issues.apache.org/jira/browse/SPARK-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15717366#comment-15717366
 ] 

koert kuipers commented on SPARK-15798:
---------------------------------------

using the implementation in spark-sorted 
(https://github.com/tresata/spark-sorted) i did some some tests. i ran about 
1.8B rows (264GB of data) through secondary sort (group, sort within groups, 
give every record an index of its position within the group) using these 
settings in yarn:
SPARK_OPTS="--executor-memory 5G --executor-cores 4 --num-executors 10 
--driver-memory 8G"
happy to report the Dataset based implementation performed better than the RDD 
implementation (18 mins vs 38 mins). i would have expected them to perform 
similarly. not sure i understand the performance difference but i will take it.

> Secondary sort in Dataset/DataFrame
> -----------------------------------
>
>                 Key: SPARK-15798
>                 URL: https://issues.apache.org/jira/browse/SPARK-15798
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: koert kuipers
>
> Secondary sort for Spark RDDs was discussed in 
> https://issues.apache.org/jira/browse/SPARK-3655
> Since the RDD API allows for easy extensions outside the core library this 
> was implemented separately here:
> https://github.com/tresata/spark-sorted
> However it seems to me that with Dataset an implementation in a 3rd party 
> library of such a feature is not really an option.
> Dataset already has methods that suggest a secondary sort is present, such as 
> in KeyValueGroupedDataset:
> {noformat}
> def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): 
> Dataset[U]
> {noformat}
> This operation pushes all the data to the reducer, something you only would 
> want to do if you need the elements in a particular order.
> How about as an API sortBy methods in KeyValueGroupedDataset and 
> RelationalGroupedDataset?
> {noformat}
> dataFrame.groupBy("a").sortBy("b").fold(...)
> {noformat}
> (yes i know RelationalGroupedDataset doesnt have a fold yet... but it should 
> :))
> {noformat}
> dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to