[GitHub] spark pull request #19531: [SPARK-22310] [SQL] Refactor join estimation to i...

gatorsmile Wed, 25 Oct 2017 14:35:09 -0700

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19531#discussion_r146993530
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala
 ---
    @@ -157,64 +154,100 @@ case class InnerOuterEstimation(join: Join) extends 
Logging {
       // scalastyle:off
       /**
        * The number of rows of A inner join B on A.k1 = B.k1 is estimated by 
this basic formula:
    -   * T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1)), where V is the 
number of distinct values of
    -   * that column. The underlying assumption for this formula is: each 
value of the smaller domain
    -   * is included in the larger domain.
    -   * Generally, inner join with multiple join keys can also be estimated 
based on the above
    -   * formula:
    +   * T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1)),
    +   * where V is the number of distinct values (ndv) of that column. The 
underlying assumption for
    +   * this formula is: each value of the smaller domain is included in the 
larger domain.
    +   *
    +   * Generally, inner join with multiple join keys can be estimated based 
on the above formula:
        * T(A IJ B) = T(A) * T(B) / (max(V(A.k1), V(B.k1)) * max(V(A.k2), 
V(B.k2)) * ... * max(V(A.kn), V(B.kn)))
        * However, the denominator can become very large and excessively reduce 
the result, so we use a
        * conservative strategy to take only the largest max(V(A.ki), V(B.ki)) 
as the denominator.
    +   *
    +   * That is, join estimation is based on the most selective join keys. We 
follow this strategy
    +   * when different types of column statistics are available. E.g., if 
card1 is the cardinality
    +   * estimated by ndv of join key A.k1 and B.k1, card2 is the cardinality 
estimated by histograms
    +   * of join key A.k2 and B.k2, then the result cardinality would be 
min(card1, card2).
        */
       // scalastyle:on
    -  def joinSelectivity(joinKeyPairs: Seq[(AttributeReference, 
AttributeReference)]): BigDecimal = {
    -    var ndvDenom: BigInt = -1
    +  private def joinCardinality(joinKeyPairs: Seq[(AttributeReference, 
AttributeReference)])
    +    : BigInt = {
    +    // If there's no column stats available for join keys, estimate as 
cartesian product.
    +    var minCard: BigInt = leftStats.rowCount.get * rightStats.rowCount.get
    --- End diff --
    
    This is different from the previous logics, right?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19531: [SPARK-22310] [SQL] Refactor join estimation to i...

Reply via email to