GitHub user Parth-Brahmbhatt opened a pull request:

    https://github.com/apache/spark/pull/14306

    Spark-16669:Adding partition prunning to Metastore statistics for better 
join selection.

    ## What changes were proposed in this pull request?
    Currently the metastore statistics returns the size of entire table which 
results in Join selection strategy to not use broadcast joins even when only a 
single partition from a large table is selected. This PR addresses that issue 
by only estimating the size of the partition by applying partition pruning 
during size estimation. Currently it only works with partition columns used 
with equality checks under AND,OR,IN Operators. If a partition column is used 
in any other operator, it defaults back to total table size. A config controls 
the behavior which is off by default.
    
    ## How was this patch tested?
    Unit tests added, and manually tested in local environment.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Parth-Brahmbhatt/spark SPARK-16669

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14306.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14306
    
----
commit 711932705c018d00ad76757d59ff2fb81023f2c8
Author: Parth Brahmbhatt <[email protected]>
Date:   2016-07-15T23:21:21Z

    [SPARK-16669][SQL]Adding partition prunning to Metastore statistics for 
better join selection.
    
    Currently the metastore statistics returns the size of entire table which 
results in Join selection stretagy to not use broadcast joins even when only a 
single partition from a large table is selected.
    This PR addresses that issue by only estimating the size of the partition 
by applying partition pruning during size estimation. Currently it only works 
with partition columns used with equality checks
    under AND,OR,IN Operators. If a partition column is used in any other 
operator, it defaults back to total table size. We have also introduced a 
configuration to enable this optimization which will be
    off by default.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to