[GitHub] spark pull request #14655: [SPARK-16669][SQL]Adding partition prunning to Me...

Parth-Brahmbhatt Mon, 15 Aug 2016 15:16:48 -0700

GitHub user Parth-Brahmbhatt opened a pull request:

    https://github.com/apache/spark/pull/14655


    [SPARK-16669][SQL]Adding partition prunning to Metastore statistics fâ¦

    ## What changes were proposed in this pull request?
    
    Adding partition prunning to Metastore statistics for better join selection.
    
    Currently the metastore statistics returns the size of entire table which 
results in Join selection stretagy to not use broadcast joins even when only a 
single partition from a large table is selected.This PR addresses that issue by 
only estimating the size of the partition by applying partition pruning during 
size estimation. Currently it only works with partition columns used with 
equality checks under AND,OR,IN Operators. If a partition column is used in any 
other operator, it defaults back to total table size. We have also introduced a 
configuration to enable this optimization which will be off by default. Instead 
of trying to calculate the path we could make a metastore query to get all the 
valid paths but for simplicity we are just building the path in code.
    
    
    ## How was this patch tested?
    Unit tests added.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Parth-Brahmbhatt/spark Spark-16669

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14655.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14655
    
----
commit 7380d4a3450b985386b2f59516baa50c896c2659
Author: Parth Brahmbhatt <[email protected]>
Date:   2016-07-15T23:21:21Z

    [SPARK-16669][SQL]Adding partition prunning to Metastore statistics for 
better join selection.
    
    Currently the metastore statistics returns the size of entire table which 
results in Join selection stretagy to not use broadcast joins even when only a 
single partition from a large table is selected.This PR addresses that issue by 
only estimating the size of the partition by applying partition pruning during 
size estimation. Currently it only works with partition columns used with 
equality checks under AND,OR,IN Operators. If a partition column is used in any 
other operator, it defaults back to total table size. We have also introduced a 
configuration to enable this optimization which will be off by default. Instead 
of trying to calculate the path we could make a metastore query to get all the 
valid paths but for simplicity we are just building the path in code.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14655: [SPARK-16669][SQL]Adding partition prunning to Me...

Reply via email to