Takeshi Yamamuro created SPARK-18591: ----------------------------------------
Summary: Replace hash-based aggregates with sort-based ones if inputs already sorted Key: SPARK-18591 URL: https://issues.apache.org/jira/browse/SPARK-18591 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.2 Reporter: Takeshi Yamamuro Spark currently uses sort-based aggregates only in limited condition; the cases where spark cannot use partial aggregates and hash-based ones. However, if input ordering has already satisfied the requirements of sort-based aggregates, it seems sort-based ones are faster than the other. {code} ./bin/spark-shell --conf spark.sql.shuffle.partitions=1 val df = spark.range(10000000).selectExpr("id AS key", "id % 10 AS value").sort($"key").cache def timer[R](block: => R): R = { val t0 = System.nanoTime() val result = block val t1 = System.nanoTime() println("Elapsed time: " + ((t1 - t0 + 0.0) / 1000000000.0)+ "s") result } timer { df.groupBy("key").count().count } // codegen'd hash aggregate Elapsed time: 7.116962977s // non-codegen'd sort aggregarte Elapsed time: 3.088816662s {code} If codegen'd sort-based aggregates are supported in SPARK-16844, this seems to make the performance gap bigger; {code} - codegen'd sort aggregate Elapsed time: 1.645234684s {code} Therefore, it'd be better to use sort-based ones in this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org