Github user gatorsmile commented on the pull request:
https://github.com/apache/spark/pull/10362#issuecomment-165611857
@yhuai Based on my understanding, currently, our strategy of data source
filtering is very conservative. We do the filtering twice. We let data sources
do the filter at the first and then Spark will do it again.
For example, given a filter ```A or (B AND C)```, if the data source is
unable to process `C`. We still push it down. The result we got is ```A or
B```. Spark will do the filtering to ensure the result is correct. I think the
current strategy will still improve the performance in most cases if the data
sources support index.
In this JIRA, the root cause is we failed to process `Not`. In the original
code, the logics is like
```
not(A and B) => not(A) and not(B)
not(A or B) => not(A) or not(B)
```
The above logic is wrong. Thus, we are unable to get the correct result.
@liancheng Please correct me if my understanding is wrong.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]