[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

ron8hu Thu, 30 Nov 2017 17:56:44 -0800

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154252063
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
    @@ -513,10 +560,9 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
     
             op match {
               case _: GreaterThan | _: GreaterThanOrEqual =>
    -            // If new ndv is 1, then new max must be equal to new min.
    -            newMin = if (newNdv == 1) newMax else newValue
    +            newMin = newValue
               case _: LessThan | _: LessThanOrEqual =>
    -            newMax = if (newNdv == 1) newMin else newValue
    +            newMax = newValue
    --- End diff --
    
    Previously I coded that way because of a corner test case: test("cbool > 
false").  At that time, I set the newMin to newMax since newNdv = 1.  However, 
this logic does not work well for the skewed distribution test case: test 
("cintHgm < 3").  In this test, newMin=1 newMax=3.  I think the revised code 
makes better sense.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Reply via email to