GitHub user gatorsmile opened a pull request:
https://github.com/apache/spark/pull/10278
[SPARK-12218] [SQL] Fixed the Parquet's filter generation rule when `Not`
is included in Parquet filter pushdown
When applying the operator `Not`, the current generation rule for Parquet
filters simply applies `Not` to all the inclusive/underlying filters.
For example, when the filter is ```"not (a = 2 and b in ('1', '2'))"```,
the generated filter is ```not (a=2)```. When we push down this filter to
Parquet, it will remove all the eligible rows satisfying the condition ```not(b
in ('1', '2'))```
In the current 1.6, the Optimizer's rule BooleanSimplification added the
following new rules in the PR(https://github.com/apache/spark/pull/5700): (BTW,
should we move this to analyzer?)
```
not(A and B) => not(A) or not(B)
not(A or B) => not(A) and not(B)
```
I do not think we should redo it in the Parquet filter generation. Thus, I
just added a condition to avoid the incorrect results in case the Optimizer is
unable to handle all the cases.
**Question**: how can we include the PR
https://github.com/apache/spark/pull/5700 into 1.5? Do you need me to submit a
new PR for 1.5? Or you can do it? This is a critical PR because the result will
be incorrect without the fix.
CC the original reviewers of https://github.com/apache/spark/pull/5700:
@marmbrus @cloud-fan
Thanks!
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/gatorsmile/spark parquetFilterNot
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10278.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10278
----
commit 79be2c3581551ab24273f3da472269814d0d736e
Author: gatorsmile <[email protected]>
Date: 2015-12-12T18:10:16Z
added a condition for `Not` operator in ParquetFilter.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]