GitHub user wzhfy opened a pull request:

    https://github.com/apache/spark/pull/19531

    [SPARK-22310] [SQL] Refactor join estimation to incorporate estimation 
logic for different kinds of statistics

    ## What changes were proposed in this pull request?
    
    The current join estimation logic is only based on basic column statistics 
(such as ndv, etc). If we want to add estimation for other kinds of statistics 
(such as histograms), it's not easy to incorporate into the current algorithm:
    1. When we have multiple pairs of join keys, the current algorithm computes 
cardinality in a single formula. But if different join keys have different 
kinds of stats, the computation logic for each pair of join keys become 
different, so the previous formula does not apply.
    2. Currently it computes cardinality and updates join keys' column stats 
separately. It's better to do these two steps together, since both computation 
and update logic are different for different kinds of stats.
    
    ## How was this patch tested?
    
    Only refactor, covered by existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wzhfy/spark join_est_refactor

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19531.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19531
    
----
commit b30de470a11ca3f360260a8a36bc1e5eb4f355e8
Author: Zhenhua Wang <[email protected]>
Date:   2017-10-19T02:45:53Z

    refactor

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to