[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

ron8hu Sun, 10 Dec 2017 15:14:45 -0800

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155963930
  
    --- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
    @@ -359,7 +371,7 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
       test("cbool > false") {
         validateEstimatedStats(
           Filter(GreaterThan(attrBool, Literal(false)), 
childStatsTestPlan(Seq(attrBool), 10L)),
    -      Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max 
= Some(true),
    +      Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max 
= Some(true),
    --- End diff --
    
    Agreed with wzhfy.  Today's logic is: for these 2 conditions, (column > x) 
and (column >= x), we set the min value to x.  We do not distinguish these 2 
cases.  This is because we do not know the exact next value larger than x if x 
is a continuous data type like double type.  We may do some special coding for 
discrete data types such as Boolean or integer.  But, as wzhfy said, it does 
not deserve the complexity.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Reply via email to