[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

wzhfy Thu, 07 Dec 2017 18:21:54 -0800

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r155691788
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
    @@ -529,6 +570,56 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
         Some(percent)
       }
     
    +  /**
    +   * Returns the selectivity percentage for binary condition in the 
column's
    +   * current valid range [min, max]
    +   *
    +   * @param op a binary comparison operator
    +   * @param histogram a numeric equi-height histogram
    +   * @param max the upper bound of the current valid range for a given 
column
    +   * @param min the lower bound of the current valid range for a given 
column
    +   * @param datumNumber the numeric value of a literal
    +   * @return the selectivity percentage for a condition in the current 
range.
    +   */
    +
    +  def computePercentByEquiHeightHgm(
    +      op: BinaryComparison,
    +      histogram: Histogram,
    +      max: Double,
    +      min: Double,
    +      datumNumber: Double): Double = {
    +    // find bins where column's current min and max locate.  Note that a 
column's [min, max]
    +    // range may change due to another condition applied earlier.
    +    val minBinId = EstimationUtils.findFirstBinForValue(min, 
histogram.bins)
    +    val maxBinId = EstimationUtils.findLastBinForValue(max, histogram.bins)
    +    assert(minBinId <= maxBinId)
    +
    +    // compute how many bins the column's current valid range [min, max] 
occupies.
    +    // Note that a column's [min, max] range may vary after we apply some 
filter conditions.
    +    val minToMaxLength = EstimationUtils.getOccupationBins(maxBinId, 
minBinId, max,
    --- End diff --
    
    Personally I prefer to have this method unit-tested, because it's the core 
part of filter estimation. We can do this in follow-up anyway.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

Reply via email to