Re: Aggregating over sorted data

Koert Kuipers Wed, 21 Dec 2016 18:25:15 -0800

i think this works but it relies on groupBy and agg respecting the sorting.
the api provides no such guarantee, so this could break in future versions.
i would not rely on this i think...


On Dec 20, 2016 18:58, "Liang-Chi Hsieh" <vii...@gmail.com> wrote:

Hi,

Can you try the combination of `repartition` + `sortWithinPartitions` on the
dataset?

E.g.,

    val df = Seq((2, "b c a"), (1, "c a b"), (3, "a c b")).toDF("number",
"letters")
    val df2 =
      df.explode('letters) {
        case Row(letters: String) => letters.split(" ").map(Tuple1(_)).toSeq
      }

    df2
      .select('number, '_1 as 'letter)
      .repartition('number)
      .sortWithinPartitions('number, 'letter)
      .groupBy('number)
      .agg(collect_list('letter))
      .show()

    +------+--------------------+
    |number|collect_list(letter)|
    +------+--------------------+
    |     3|           [a, b, c]|
    |     1|           [a, b, c]|
    |     2|           [a, b, c]|
    +------+--------------------+

I think it should let you do aggregate on sorted data per key.




-----
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
--
View this message in context: http://apache-spark-
developers-list.1001551.n3.nabble.com/Aggregating-over-
sorted-data-tp19999p20310.html
Sent from the Apache Spark Developers List mailing list archive at
Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Aggregating over sorted data

Reply via email to