Github user juliuszsompolski commented on a diff in the pull request: https://github.com/apache/spark/pull/21133#discussion_r184656896 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ApproximatePercentileQuerySuite.scala --- @@ -279,4 +282,11 @@ class ApproximatePercentileQuerySuite extends QueryTest with SharedSQLContext { checkAnswer(query, expected) } } + + test("SPARK-24013: unneeded compress can cause performance issues with sorted input") { + failAfter(30 seconds) { + checkAnswer(sql("select approx_percentile(id, array(0.1)) from range(10000000)"), + Row(Array(999160))) --- End diff -- nit: With the approx nature of the algorithm, could the exact answer not get flakty through some small changes in code or config? (like e.g. the split of range into tasks, and then different merging of partial aggrs producing slightly different results) maybe just asserting on collect().length == 1 would do?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org