Github user juliuszsompolski commented on a diff in the pull request:
https://github.com/apache/spark/pull/21133#discussion_r184656896
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/ApproximatePercentileQuerySuite.scala
---
@@ -279,4 +282,11 @@ class ApproximatePercentileQuerySuite extends
QueryTest with SharedSQLContext {
checkAnswer(query, expected)
}
}
+
+ test("SPARK-24013: unneeded compress can cause performance issues with
sorted input") {
+ failAfter(30 seconds) {
+ checkAnswer(sql("select approx_percentile(id, array(0.1)) from
range(10000000)"),
+ Row(Array(999160)))
--- End diff --
nit:
With the approx nature of the algorithm, could the exact answer not get
flakty through some small changes in code or config? (like e.g. the split of
range into tasks, and then different merging of partial aggrs producing
slightly different results)
maybe just asserting on collect().length == 1 would do?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]