i think this works but it relies on groupBy and agg respecting the sorting. the api provides no such guarantee, so this could break in future versions. i would not rely on this i think...
On Dec 20, 2016 18:58, "Liang-Chi Hsieh" <vii...@gmail.com> wrote: Hi, Can you try the combination of `repartition` + `sortWithinPartitions` on the dataset? E.g., val df = Seq((2, "b c a"), (1, "c a b"), (3, "a c b")).toDF("number", "letters") val df2 = df.explode('letters) { case Row(letters: String) => letters.split(" ").map(Tuple1(_)).toSeq } df2 .select('number, '_1 as 'letter) .repartition('number) .sortWithinPartitions('number, 'letter) .groupBy('number) .agg(collect_list('letter)) .show() +------+--------------------+ |number|collect_list(letter)| +------+--------------------+ | 3| [a, b, c]| | 1| [a, b, c]| | 2| [a, b, c]| +------+--------------------+ I think it should let you do aggregate on sorted data per key. ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark- developers-list.1001551.n3.nabble.com/Aggregating-over- sorted-data-tp19999p20310.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org