[GitHub] spark pull request: [SPARK-6910] [SQL] [WIP] Support for pushing p...

marmbrus Mon, 22 Jun 2015 16:58:14 -0700

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/6921#issuecomment-114307760
  
    @piaozhexiu, thanks for working on this!  I did some more investigation on 
what types of filters seem to be supported.  It seems like we might be able to 
support `>, <, <= >=` as well, as long as we encode it the filter correctly.
    
    ```scala
    scala> sql("CREATE TABLE test (key INT) PARTITIONED BY (i INT, s STRING)")
    
    // works
    scala> 
org.apache.spark.sql.hive.test.TestHive.catalog.client.getPartitionsByFilter(testTable,
 "s='11'")
    // works
    scala> 
org.apache.spark.sql.hive.test.TestHive.catalog.client.getPartitionsByFilter(testTable,
 "s>'11'")
    // works
    scala> 
org.apache.spark.sql.hive.test.TestHive.catalog.client.getPartitionsByFilter(testTable,
 "i=11")
    // works
    scala> 
org.apache.spark.sql.hive.test.TestHive.catalog.client.getPartitionsByFilter(testTable,
 "i>11")
    
    // fails: Filtering is supported only on partition keys of type string:
    scala> 
org.apache.spark.sql.hive.test.TestHive.catalog.client.getPartitionsByFilter(testTable,
 "i>'11'")
    ```
    
    I'm curious what problems you ran into exactly and which version of hive 
were you using when you encountered problems?
    
    A few other comments on the structure of this change?
     - I think we should avoid adding the filter into the logical relation.  
The problem here is that this logic is running well before we have done filter 
pushdown, and as a result we could miss significant optimization possibilities. 
 Instead, I'd modify the `HiveTableScan`, which already has a complete list of 
predicates that can be used to prune partitions.  When we are building the 
scan, we can pass this expression into the logical relation (and probably 
remove the `lazy val` that enumerates all partitions completely).
     - I'd move the logic that constructs the filter expression into the 
`HiveShim` and just pass in raw catalyst `Expression` objects through the 
client interface.  This way if newer hive versions allow more filters in the 
future we can add them on a version specific basis.  The contract can be that 
the HiveClient does "best effort" filtering, and we will always reevaluate all 
filters that are returned by the client.
    
    What do you think?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-6910] [SQL] [WIP] Support for pushing p...

Reply via email to