Github user wzhfy commented on a diff in the pull request:
https://github.com/apache/spark/pull/19594#discussion_r156847785
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala
---
@@ -191,8 +191,16 @@ case class JoinEstimation(join: Join) extends Logging {
val rInterval = ValueInterval(rightKeyStat.min, rightKeyStat.max,
rightKey.dataType)
if (ValueInterval.isIntersected(lInterval, rInterval)) {
val (newMin, newMax) = ValueInterval.intersect(lInterval,
rInterval, leftKey.dataType)
- val (card, joinStat) = computeByNdv(leftKey, rightKey, newMin,
newMax)
- keyStatsAfterJoin += (leftKey -> joinStat, rightKey -> joinStat)
+ val (card, joinStat) = (leftKeyStat.histogram,
rightKeyStat.histogram) match {
+ case (Some(l: Histogram), Some(r: Histogram)) =>
+ computeByEquiHeightHistogram(leftKey, rightKey, l, r, newMin,
newMax)
+ case _ =>
+ computeByNdv(leftKey, rightKey, newMin, newMax)
+ }
+ keyStatsAfterJoin += (
+ leftKey -> joinStat.copy(histogram = leftKeyStat.histogram),
+ rightKey -> joinStat.copy(histogram = rightKeyStat.histogram)
--- End diff --
Currently we don't update histogram since min/max can help us to know which
bins are valid. It doesn't affect correctness. But updating histograms helps to
reduce memory usage for histogram propagation. We can do this in both filter
and join estimation in following PRs.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]