[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...

srowen Sat, 07 Oct 2017 00:25:25 -0700

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19438#discussion_r143324827
  
    --- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala
 ---
    @@ -58,7 +58,7 @@ class QuantileSummariesSuite extends SparkFunSuite {
         if (data.nonEmpty) {
           val approx = summary.query(quant).get
           // The rank of the approximation.
    -      val rank = data.count(_ < approx) // has to be <, not <= to be exact
    +      val rank = data.count(_ <= approx)
    --- End diff --
    
    This is trying to recover the rank of the element that was picked as the 
quantile. I think that's problematic when the input repeats the value chosen as 
the quantile. Consider estimating the median of [1,2,2,2,2,2,2,2,3]. If the 
method correctly picks 2, depending on whether you define this test as < or <=, 
you conclude that it picked rank 1 or 8 of 9 as the median. Any reasonable test 
of whether that rank is near the expected 5 will fail either way in some cases.
    
    One reasonable fix it to actually use the average of `data.count(_ < 
approx)` and `data.count(_ <= approx)` as the implied rank that was chosen.
    
    Do you know which test case failed without this change?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...

Reply via email to