Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19531#discussion_r147466305
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala
---
@@ -157,64 +154,100 @@ case class InnerOuterEstimation(join: Join) extends
Logging {
// scalastyle:off
/**
* The number of rows of A inner join B on A.k1 = B.k1 is estimated by
this basic formula:
- * T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1)), where V is the
number of distinct values of
- * that column. The underlying assumption for this formula is: each
value of the smaller domain
- * is included in the larger domain.
- * Generally, inner join with multiple join keys can also be estimated
based on the above
- * formula:
+ * T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1)),
+ * where V is the number of distinct values (ndv) of that column. The
underlying assumption for
+ * this formula is: each value of the smaller domain is included in the
larger domain.
+ *
+ * Generally, inner join with multiple join keys can be estimated based
on the above formula:
* T(A IJ B) = T(A) * T(B) / (max(V(A.k1), V(B.k1)) * max(V(A.k2),
V(B.k2)) * ... * max(V(A.kn), V(B.kn)))
* However, the denominator can become very large and excessively reduce
the result, so we use a
* conservative strategy to take only the largest max(V(A.ki), V(B.ki))
as the denominator.
+ *
+ * That is, join estimation is based on the most selective join keys. We
follow this strategy
+ * when different types of column statistics are available. E.g., if
card1 is the cardinality
+ * estimated by ndv of join key A.k1 and B.k1, card2 is the cardinality
estimated by histograms
+ * of join key A.k2 and B.k2, then the result cardinality would be
min(card1, card2).
*/
// scalastyle:on
- def joinSelectivity(joinKeyPairs: Seq[(AttributeReference,
AttributeReference)]): BigDecimal = {
- var ndvDenom: BigInt = -1
+ private def joinCardinality(joinKeyPairs: Seq[(AttributeReference,
AttributeReference)])
+ : BigInt = {
+ // If there's no column stats available for join keys, estimate as
cartesian product.
+ var minCard: BigInt = leftStats.rowCount.get * rightStats.rowCount.get
--- End diff --
I see.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]