[ https://issues.apache.org/jira/browse/SPARK-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15717366#comment-15717366 ]
koert kuipers commented on SPARK-15798: --------------------------------------- using the implementation in spark-sorted (https://github.com/tresata/spark-sorted) i did some some tests. i ran about 1.8B rows (264GB of data) through secondary sort (group, sort within groups, give every record an index of its position within the group) using these settings in yarn: SPARK_OPTS="--executor-memory 5G --executor-cores 4 --num-executors 10 --driver-memory 8G" happy to report the Dataset based implementation performed better than the RDD implementation (18 mins vs 38 mins). i would have expected them to perform similarly. not sure i understand the performance difference but i will take it. > Secondary sort in Dataset/DataFrame > ----------------------------------- > > Key: SPARK-15798 > URL: https://issues.apache.org/jira/browse/SPARK-15798 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: koert kuipers > > Secondary sort for Spark RDDs was discussed in > https://issues.apache.org/jira/browse/SPARK-3655 > Since the RDD API allows for easy extensions outside the core library this > was implemented separately here: > https://github.com/tresata/spark-sorted > However it seems to me that with Dataset an implementation in a 3rd party > library of such a feature is not really an option. > Dataset already has methods that suggest a secondary sort is present, such as > in KeyValueGroupedDataset: > {noformat} > def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): > Dataset[U] > {noformat} > This operation pushes all the data to the reducer, something you only would > want to do if you need the elements in a particular order. > How about as an API sortBy methods in KeyValueGroupedDataset and > RelationalGroupedDataset? > {noformat} > dataFrame.groupBy("a").sortBy("b").fold(...) > {noformat} > (yes i know RelationalGroupedDataset doesnt have a fold yet... but it should > :)) > {noformat} > dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org