GitHub user Parth-Brahmbhatt opened a pull request:
https://github.com/apache/spark/pull/14655
[SPARK-16669][SQL]Adding partition prunning to Metastore statistics fâ¦
## What changes were proposed in this pull request?
Adding partition prunning to Metastore statistics for better join selection.
Currently the metastore statistics returns the size of entire table which
results in Join selection stretagy to not use broadcast joins even when only a
single partition from a large table is selected.This PR addresses that issue by
only estimating the size of the partition by applying partition pruning during
size estimation. Currently it only works with partition columns used with
equality checks under AND,OR,IN Operators. If a partition column is used in any
other operator, it defaults back to total table size. We have also introduced a
configuration to enable this optimization which will be off by default. Instead
of trying to calculate the path we could make a metastore query to get all the
valid paths but for simplicity we are just building the path in code.
## How was this patch tested?
Unit tests added.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/Parth-Brahmbhatt/spark Spark-16669
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14655.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14655
----
commit 7380d4a3450b985386b2f59516baa50c896c2659
Author: Parth Brahmbhatt <[email protected]>
Date: 2016-07-15T23:21:21Z
[SPARK-16669][SQL]Adding partition prunning to Metastore statistics for
better join selection.
Currently the metastore statistics returns the size of entire table which
results in Join selection stretagy to not use broadcast joins even when only a
single partition from a large table is selected.This PR addresses that issue by
only estimating the size of the partition by applying partition pruning during
size estimation. Currently it only works with partition columns used with
equality checks under AND,OR,IN Operators. If a partition column is used in any
other operator, it defaults back to total table size. We have also introduced a
configuration to enable this optimization which will be off by default. Instead
of trying to calculate the path we could make a metastore query to get all the
valid paths but for simplicity we are just building the path in code.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]