viirya commented on code in PR #48025:
URL: https://github.com/apache/spark/pull/48025#discussion_r1755295545


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala:
##########
@@ -231,12 +231,11 @@ case class JoinEstimation(join: Join) extends Logging {
       max: Option[Any]): (BigInt, ColumnStat) = {
     val leftKeyStat = leftStats.attributeStats(leftKey)
     val rightKeyStat = rightStats.attributeStats(rightKey)
-    val maxNdv = 
leftKeyStat.distinctCount.get.max(rightKeyStat.distinctCount.get)
+    val newNdv = 
leftKeyStat.distinctCount.get.min(rightKeyStat.distinctCount.get)
     // Compute cardinality by the basic formula.
-    val card = BigDecimal(leftStats.rowCount.get * rightStats.rowCount.get) / 
BigDecimal(maxNdv)
+    val card = BigDecimal(leftStats.rowCount.get * rightStats.rowCount.get) / 
BigDecimal(newNdv)

Review Comment:
   I don't see too much explanation why this is a better estimation in the PR 
description or in code comment. Could you add some?
   
   Roughly looks like this change tries to overestimate cardinality instead 
underestimating previously, so it lowers the chance to broadcast join table.
   
   Before this, if the join table fails to broadcast, user can disable 
broadcast to run the query. If after this change Spark doesn't broadcast the 
join table (as estimated states are higher than threshold), user may not be 
able to broadcast it (as the threshold is a hard limit).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to