GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/10278

    [SPARK-12218] [SQL] Fixed the Parquet's filter generation rule when `Not` 
is included in Parquet filter pushdown

    When applying the operator `Not`, the current generation rule for Parquet 
filters simply applies `Not` to all the inclusive/underlying filters. 
    
    For example, when the filter is ```"not (a = 2 and b in ('1', '2'))"```, 
the generated filter is ```not (a=2)```. When we push down this filter to 
Parquet, it will remove all the eligible rows satisfying the condition ```not(b 
in ('1', '2'))```
    
    In the current 1.6, the Optimizer's rule BooleanSimplification added the 
following new rules in the PR(https://github.com/apache/spark/pull/5700): (BTW, 
should we move this to analyzer?) 
    ```
            not(A and B) => not(A) or not(B)
            not(A or B) => not(A) and not(B)
    ```
    I do not think we should redo it in the Parquet filter generation. Thus, I 
just added a condition to avoid the incorrect results in case the Optimizer is 
unable to handle all the cases. 
    
    **Question**: how can we include the PR 
https://github.com/apache/spark/pull/5700 into 1.5? Do you need me to submit a 
new PR for 1.5? Or you can do it? This is a critical PR because the result will 
be incorrect without the fix.
    
    CC the original reviewers of https://github.com/apache/spark/pull/5700: 
@marmbrus @cloud-fan 
    
    Thanks!

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark parquetFilterNot

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10278.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10278
    
----
commit 79be2c3581551ab24273f3da472269814d0d736e
Author: gatorsmile <[email protected]>
Date:   2015-12-12T18:10:16Z

    added a condition for `Not` operator in ParquetFilter.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to