[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...

srowen Sat, 07 Oct 2017 00:25:25 -0700

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19438#discussion_r143324966
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -157,21 +157,21 @@ class DataFrameStatSuite extends QueryTest with 
SharedSQLContext {
           val error_single = 2 * 1000 * epsilon
           val error_double = 2 * 2000 * epsilon
     
    -      assert(math.abs(single1 - q1 * n) < error_single)
    -      assert(math.abs(double2 - 2 * q2 * n) < error_double)
    -      assert(math.abs(s1 - q1 * n) < error_single)
    -      assert(math.abs(s2 - q2 * n) < error_single)
    -      assert(math.abs(d1 - 2 * q1 * n) < error_double)
    -      assert(math.abs(d2 - 2 * q2 * n) < error_double)
    +      assert(math.abs(single1 - q1 * n) <= error_single)
    --- End diff --
    
    Were these failing?
    I think the test is a little off. The input col "singles" is 0-999, not 
1-1000. The median, for example, could really be any number between 499 and 
500. It might conventionally be defined as 499.5 but given that this is 
approximate and chooses an integral rank, 499 and 500 are OK, as are 498 and 
501.
    
    I think loosening the condition like this is OK, it's coherent. It also 
strikes me that changing to `Seq.tablulate(n+1)...` above would make the 
expected values implied here correct and thus also fix it.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...

Reply via email to