Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19594#discussion_r156861642
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala
 ---
    @@ -191,8 +191,16 @@ case class JoinEstimation(join: Join) extends Logging {
           val rInterval = ValueInterval(rightKeyStat.min, rightKeyStat.max, 
rightKey.dataType)
           if (ValueInterval.isIntersected(lInterval, rInterval)) {
             val (newMin, newMax) = ValueInterval.intersect(lInterval, 
rInterval, leftKey.dataType)
    -        val (card, joinStat) = computeByNdv(leftKey, rightKey, newMin, 
newMax)
    -        keyStatsAfterJoin += (leftKey -> joinStat, rightKey -> joinStat)
    +        val (card, joinStat) = (leftKeyStat.histogram, 
rightKeyStat.histogram) match {
    +          case (Some(l: Histogram), Some(r: Histogram)) =>
    +            computeByEquiHeightHistogram(leftKey, rightKey, l, r, newMin, 
newMax)
    +          case _ =>
    +            computeByNdv(leftKey, rightKey, newMin, newMax)
    +        }
    +        keyStatsAfterJoin += (
    +          leftKey -> joinStat.copy(histogram = leftKeyStat.histogram),
    +          rightKey -> joinStat.copy(histogram = rightKeyStat.histogram)
    --- End diff --
    
    Actually keeping it unchanged is more memory efficient. We just pass around 
pointers, but updating the histogram means creating a new one.
    
    Let's keep it, and add some comments to explain it


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to