Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/6921#issuecomment-114307760
@piaozhexiu, thanks for working on this! I did some more investigation on
what types of filters seem to be supported. It seems like we might be able to
support `>, <, <= >=` as well, as long as we encode it the filter correctly.
```scala
scala> sql("CREATE TABLE test (key INT) PARTITIONED BY (i INT, s STRING)")
// works
scala>
org.apache.spark.sql.hive.test.TestHive.catalog.client.getPartitionsByFilter(testTable,
"s='11'")
// works
scala>
org.apache.spark.sql.hive.test.TestHive.catalog.client.getPartitionsByFilter(testTable,
"s>'11'")
// works
scala>
org.apache.spark.sql.hive.test.TestHive.catalog.client.getPartitionsByFilter(testTable,
"i=11")
// works
scala>
org.apache.spark.sql.hive.test.TestHive.catalog.client.getPartitionsByFilter(testTable,
"i>11")
// fails: Filtering is supported only on partition keys of type string:
scala>
org.apache.spark.sql.hive.test.TestHive.catalog.client.getPartitionsByFilter(testTable,
"i>'11'")
```
I'm curious what problems you ran into exactly and which version of hive
were you using when you encountered problems?
A few other comments on the structure of this change?
- I think we should avoid adding the filter into the logical relation.
The problem here is that this logic is running well before we have done filter
pushdown, and as a result we could miss significant optimization possibilities.
Instead, I'd modify the `HiveTableScan`, which already has a complete list of
predicates that can be used to prune partitions. When we are building the
scan, we can pass this expression into the logical relation (and probably
remove the `lazy val` that enumerates all partitions completely).
- I'd move the logic that constructs the filter expression into the
`HiveShim` and just pass in raw catalyst `Expression` objects through the
client interface. This way if newer hive versions allow more filters in the
future we can add them on a version specific basis. The contract can be that
the HiveClient does "best effort" filtering, and we will always reevaluate all
filters that are returned by the client.
What do you think?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]