[GitHub] spark pull request #19594: [SPARK-21984] [SQL] Join estimation based on equi...

wzhfy Wed, 13 Dec 2017 19:42:49 -0800

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19594#discussion_r156847785
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala
 ---
    @@ -191,8 +191,16 @@ case class JoinEstimation(join: Join) extends Logging {
           val rInterval = ValueInterval(rightKeyStat.min, rightKeyStat.max, 
rightKey.dataType)
           if (ValueInterval.isIntersected(lInterval, rInterval)) {
             val (newMin, newMax) = ValueInterval.intersect(lInterval, 
rInterval, leftKey.dataType)
    -        val (card, joinStat) = computeByNdv(leftKey, rightKey, newMin, 
newMax)
    -        keyStatsAfterJoin += (leftKey -> joinStat, rightKey -> joinStat)
    +        val (card, joinStat) = (leftKeyStat.histogram, 
rightKeyStat.histogram) match {
    +          case (Some(l: Histogram), Some(r: Histogram)) =>
    +            computeByEquiHeightHistogram(leftKey, rightKey, l, r, newMin, 
newMax)
    +          case _ =>
    +            computeByNdv(leftKey, rightKey, newMin, newMax)
    +        }
    +        keyStatsAfterJoin += (
    +          leftKey -> joinStat.copy(histogram = leftKeyStat.histogram),
    +          rightKey -> joinStat.copy(histogram = rightKeyStat.histogram)
    --- End diff --
    
    Currently we don't update histogram since min/max can help us to know which 
bins are valid. It doesn't affect correctness. But updating histograms helps to 
reduce memory usage for histogram propagation. We can do this in both filter 
and join estimation in following PRs.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19594: [SPARK-21984] [SQL] Join estimation based on equi...

Reply via email to