Github user wzhfy commented on a diff in the pull request:
https://github.com/apache/spark/pull/19783#discussion_r155691788
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
---
@@ -529,6 +570,56 @@ case class FilterEstimation(plan: Filter) extends
Logging {
Some(percent)
}
+ /**
+ * Returns the selectivity percentage for binary condition in the
column's
+ * current valid range [min, max]
+ *
+ * @param op a binary comparison operator
+ * @param histogram a numeric equi-height histogram
+ * @param max the upper bound of the current valid range for a given
column
+ * @param min the lower bound of the current valid range for a given
column
+ * @param datumNumber the numeric value of a literal
+ * @return the selectivity percentage for a condition in the current
range.
+ */
+
+ def computePercentByEquiHeightHgm(
+ op: BinaryComparison,
+ histogram: Histogram,
+ max: Double,
+ min: Double,
+ datumNumber: Double): Double = {
+ // find bins where column's current min and max locate. Note that a
column's [min, max]
+ // range may change due to another condition applied earlier.
+ val minBinId = EstimationUtils.findFirstBinForValue(min,
histogram.bins)
+ val maxBinId = EstimationUtils.findLastBinForValue(max, histogram.bins)
+ assert(minBinId <= maxBinId)
+
+ // compute how many bins the column's current valid range [min, max]
occupies.
+ // Note that a column's [min, max] range may vary after we apply some
filter conditions.
+ val minToMaxLength = EstimationUtils.getOccupationBins(maxBinId,
minBinId, max,
--- End diff --
Personally I prefer to have this method unit-tested, because it's the core
part of filter estimation. We can do this in follow-up anyway.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]