pavel0fadeev commented on code in PR #48025:
URL: https://github.com/apache/spark/pull/48025#discussion_r1749380632


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala:
##########
@@ -233,7 +233,8 @@ case class JoinEstimation(join: Join) extends Logging {
     val rightKeyStat = rightStats.attributeStats(rightKey)
     val maxNdv = 
leftKeyStat.distinctCount.get.max(rightKeyStat.distinctCount.get)

Review Comment:
   This line is not needed after your change as I understand



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala:
##########
@@ -233,7 +233,8 @@ case class JoinEstimation(join: Join) extends Logging {
     val rightKeyStat = rightStats.attributeStats(rightKey)
     val maxNdv = 
leftKeyStat.distinctCount.get.max(rightKeyStat.distinctCount.get)
     // Compute cardinality by the basic formula.
-    val card = BigDecimal(leftStats.rowCount.get * rightStats.rowCount.get) / 
BigDecimal(maxNdv)
+    val card = BigDecimal(leftStats.rowCount.get * rightStats.rowCount.get) /
+      
BigDecimal(leftKeyStat.distinctCount.get.min(rightKeyStat.distinctCount.get))

Review Comment:
   The original formula is practically very rough, but it gives the exact 
cardinality if the distribution of each table is uniform and all values in the 
table with smaller distinctCount match in the other table.
   But can you share if there is any theory behind the new formula?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to