GitHub user mallman opened a pull request:
https://github.com/apache/spark/pull/17633
[SPARK-20331][SQL] Enhanced Hive partition pruning predicate pushdown
(Link to Jira: https://issues.apache.org/jira/browse/SPARK-20331)
## What changes were proposed in this pull request?
Spark 2.1 introduced scalable support for Hive tables with huge numbers of
partitions. Key to leveraging this support is the ability to prune unnecessary
table partitions to answer queries. Spark supports a subset of the class of
partition pruning predicates that the Hive metastore supports. If a user writes
a query with a partition pruning predicate that is *not* supported by Spark,
Spark falls back to loading all partitions and pruning client-side. We want to
broaden Spark's current partition pruning predicate pushdown capabilities.
One of the key missing capabilities is support for disjunctions. For
example, for a table partitioned by date, specifying with a predicate like
date = 20161011 or date = 20161014
will result in Spark fetching all partitions. For a table partitioned by
date and hour, querying a range of hours across dates can be quite difficult to
accomplish without fetching all partition metadata.
The current partition pruning support supports only comparisons against
literals. We can expand that to foldable expressions by evaluating them at
planning time.
We can also implement support for the "IN" comparison by expanding it to a
sequence of "OR"s.
## How was this patch tested?
New tests were added to `HiveClientSuite.scala`. Existing tests in
`FiltersSuite.scala` were adjusted to add parentheses to test strings.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/VideoAmp/spark-public
spark-20331-enhanced_partition_pruning_pushdown
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17633.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17633
----
commit 1a7663d8c40dc86c59846b52dba9766859d8e095
Author: Michael Allman <[email protected]>
Date: 2017-03-13T18:44:15Z
[SPARK-20331][SQL] Enhanced partition pruning pushdown
Add support for evaluating foldable, deterministic expressions when
computing Hive metastore filter string for getting filtered partition
list
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]