GitHub user ameent opened a pull request:
https://github.com/apache/spark/pull/20100
[SPARK-22913][SQL] Improved Hive Partition Pruning
Adding support for Timestamp and Fractional column types. The pruning
of partitions of these types is being put behind default options
that are set to false, as it's not clear which hive metastore
implementations support predicates on these types of columns.
The AWS Glue Catalog
http://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html
does support filters on timestamp and fractional columns and pushing these
filters
down to it has significant performance improvements in our use cases.
As part of this change the hive pruning suite is renamed (a TODO) and 2
ignored tests are added that will validate the functionality of partition
pruning through integration tests. The tests are ignored since the
integration
test setup uses a Hive client that throws errors when it sees partition
column
filters on non-integral and non-string columns.
Unit tests are added to validate filtering, which are active.
## What changes were proposed in this pull request?
See https://issues.apache.org/jira/browse/SPARK-22913
This change addresses the JIRA. I'm looking for feedback on the change
itself and whether the config values I added make sense. I was not able to find
official Hive specification on which filters a metastore needs to support and
as such, feel hesitant to turn on this behavior by default. Piggybacking on top
of "advancedPartitionPruning" option felt wrong because that config toggles
whether "in (...)" queries are expanded in a series of "ors" and I don't want
people to be forced to turn off that behavior alongside not pushing timestamp
predicates.
## How was this patch tested?
This change is tested via unit tests, modified integration tests (that are
ignored) and manual tests on EMR 5.10 running against AWS Glue Catalog as the
Hive metastore.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ameent/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20100.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20100
----
commit 6b1d5dc8874bba7c707428818123ec63fd7e84f0
Author: Ameen Tayyebi <ameen.tayyebi@...>
Date: 2017-12-28T02:56:13Z
[SPARK-22913][SQL] Improved Hive Partition Pruning
Adding support for Timestamp and Fractional column types. The pruning
of partitions of these types is being put behind default options
that are set to false, as it's not clear which hive metastore
implementations support predicates on these types of columns.
The AWS Glue Catalog
http://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html
does support filters on timestamp and fractional columns and pushing these
filters
down to it has significant performance improvements in our use cases.
As part of this change the hive pruning suite is renamed (a TODO) and 2
ignored tests are added that will validate the functionality of partition
pruning through integration tests. The tests are ignored since the
integration
test setup uses a Hive client that throws errors when it sees partition
column
filters on non-integral and non-string columns.
Unit tests are added to validate filtering, which are active.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]