Github user chenghao-intel commented on the pull request:
https://github.com/apache/spark/pull/9399#issuecomment-152930901
Oh, for example: let's say we have the table src (key, value) partition (p1)
For the query like "SELECT value FROM src WHERE key > p1",
And we assume the p1 candidates are 10, 100, and the `key` range is (0, 50).
-- `unhandledFilter` = Array.empty
This probably fail in `key > 10` (p1 = 10), as we may not able to filter
records during the scan, before we taking out all of the records, or in
`buildScan`, we should add an extra filter operation on RDD[Row].
-- `unhandledFilter` = `key > p1`
We will loss the optimization for partition (p1 = 100), since the concrete
filter is `key > 100`, and we should always return RDD[Row].empty, as the range
of key is (0, 50).
I mean it will be confused to the new data source developers, how to define
the `unhandledFilter`. as the partition key is not treated like the normal
attributes, at least it requires more work in getting the concrete value and
multiple filter in the planning stage for different partition keys, what's the
`unhandledFilter` supposed to retrieve?
On the other hand, I am not sure if it's really necessary to expose the
`unhandledFilter`, as it's will be new API for data source that the developer
should be aware for optimization purpose, but, we we pass down the filters via
API `def buildScan(requiredColumns: Array[String], filters: Array[Filter]):
RDD[Row]` and its variants already. Splitting the filter expressions into 2
parts, and executed in different operators (DataSourceStrategy and DataSource
impelementation) seems making thing more complicated, despite we will do the
splitting in the data source implementation, but probably not wise enough to
expose that externally.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]