GitHub user Parth-Brahmbhatt opened a pull request:
https://github.com/apache/spark/pull/14306
Spark-16669:Adding partition prunning to Metastore statistics for better
join selection.
## What changes were proposed in this pull request?
Currently the metastore statistics returns the size of entire table which
results in Join selection strategy to not use broadcast joins even when only a
single partition from a large table is selected. This PR addresses that issue
by only estimating the size of the partition by applying partition pruning
during size estimation. Currently it only works with partition columns used
with equality checks under AND,OR,IN Operators. If a partition column is used
in any other operator, it defaults back to total table size. A config controls
the behavior which is off by default.
## How was this patch tested?
Unit tests added, and manually tested in local environment.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/Parth-Brahmbhatt/spark SPARK-16669
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14306.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14306
----
commit 711932705c018d00ad76757d59ff2fb81023f2c8
Author: Parth Brahmbhatt <[email protected]>
Date: 2016-07-15T23:21:21Z
[SPARK-16669][SQL]Adding partition prunning to Metastore statistics for
better join selection.
Currently the metastore statistics returns the size of entire table which
results in Join selection stretagy to not use broadcast joins even when only a
single partition from a large table is selected.
This PR addresses that issue by only estimating the size of the partition
by applying partition pruning during size estimation. Currently it only works
with partition columns used with equality checks
under AND,OR,IN Operators. If a partition column is used in any other
operator, it defaults back to total table size. We have also introduced a
configuration to enable this optimization which will be
off by default.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]