[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...

chenghao-intel Sun, 01 Nov 2015 22:36:01 -0800

Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/9399#issuecomment-152930901
  
    Oh, for example: let's say we have the table src (key, value) partition (p1)
    For the query like "SELECT value FROM src WHERE key > p1", 
    
    And we assume the p1 candidates are 10, 100, and the `key` range is (0, 50).
    -- `unhandledFilter` = Array.empty
    This probably fail in `key > 10` (p1 = 10), as we may not able to filter 
records during the scan, before we taking out all of the records, or in 
`buildScan`, we should add an extra filter operation on RDD[Row].
    -- `unhandledFilter` = `key > p1`
    We will loss the optimization for partition (p1 = 100), since the concrete 
filter is `key > 100`, and we should always return RDD[Row].empty, as the range 
of key is (0, 50).
    
    I mean it will be confused to the new data source developers, how to define 
the `unhandledFilter`. as the partition key is not treated like the normal 
attributes, at least it requires more work in getting the  concrete value and 
multiple filter in the planning stage for different partition keys, what's the 
`unhandledFilter` supposed to retrieve?
    
    On the other hand, I am not sure if it's really necessary to expose the 
`unhandledFilter`, as it's will be new API for data source that the developer 
should be aware for optimization purpose, but, we we pass down the filters via 
API `def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
RDD[Row]` and its variants already. Splitting the filter expressions into 2 
parts, and executed in different operators (DataSourceStrategy and DataSource 
impelementation) seems making thing more complicated, despite we will do the 
splitting in the data source implementation, but probably not wise enough to 
expose that externally.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-10978] [SQL] Allow data sources to elim...

Reply via email to