Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/10362#issuecomment-165737899
  
    @gatorsmile Discussed with @yhuai offline, and here's my two cents:
    
    1.  Correctness
    
        The current master and 1.6 code is correct, but unfortunately it's 
correct by accident rather than by design.
    
        Essentially, the current Parquet and ORC filter push-down code is 
flawed because it doesn't handle `Not` correctly.  Take Parquet as an example, 
the filter push-down code within parquet-mr has a separate phase for 
eliminating `Not` (e.g., converting `!(a > 1)` to `a <= 1`). That's why it's 
safe for parquet-mr to prune the data using only a single branch of `And` if 
the other branch is not applicable according to row group statistics (min/max, 
etc.).
    
        As you mentioned, the current logic is quite tricky, which implies that 
it's error prone and hard to maintain.  Here I tend to agree with @yhuai and be 
more conservative for better maintainability.  For the long run, CNF conversion 
can be a promising solution.
    
    2.  Performance
    
        Being conservative while dealing with `And` doesn't hurt performance in 
most cases because of the way we dealing with Parquet filter predicate 
conversion.  You may see that, in `ParquetRelation`, the predicate is firstly 
split and then converted individually.  So a filter predicate like
    
        ```
        a > 1 AND b < 10 AND weird_udf(c)
        ```
    
        is not affected. `a > 1` and `b < 10` are still pushed down.  (Many 
thanks to @yhuai, actually I didn't notice this at first.)  But filter 
predicates containing nested conjunction(s) are affected.  E.g., nothing in the 
following predicate can be pushed down:
    
        ```
        a > 1 OR (b < 10 AND weird_udf(c))
        ```
    
        (Note that this case can be fixed once we have CNF conversion)
    
        However, we did't do the same thing in `OrcRelation`, so filter 
predicates in both cases are affected in ORC.  I'm working on a fix for this.
    
    3.  Conservative filtering strategy
    
        We just added an `unhandledFilters` API for data sources, so that Spark 
SQL doesn't perform the conservative filtering if the underlying data source 
implements this method and tells Spark SQL which filters they can't handle.  
Parquet and ORC data sources in Spark SQL already implemented this API.
    
    4.  Follow-ups
    
        - Backporting #5700 and #8716 to branch-1.5.
        - Fixing the ORC filter push-down performance issue mentioned above. 
(I'm working on it.)
        - Adding CNF conversion to the optimizer, and explicitly guarantees 
that all the filter predicate a data source receives are already in CNF.  (This 
also implies that data sources don't even need to handle `And` anymore.)
        - Optionally, it would be good to have an optimization rule for 
eliminating negations.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to