Github user liancheng commented on the pull request:
https://github.com/apache/spark/pull/10362#issuecomment-165737899
@gatorsmile Discussed with @yhuai offline, and here's my two cents:
1. Correctness
The current master and 1.6 code is correct, but unfortunately it's
correct by accident rather than by design.
Essentially, the current Parquet and ORC filter push-down code is
flawed because it doesn't handle `Not` correctly. Take Parquet as an example,
the filter push-down code within parquet-mr has a separate phase for
eliminating `Not` (e.g., converting `!(a > 1)` to `a <= 1`). That's why it's
safe for parquet-mr to prune the data using only a single branch of `And` if
the other branch is not applicable according to row group statistics (min/max,
etc.).
As you mentioned, the current logic is quite tricky, which implies that
it's error prone and hard to maintain. Here I tend to agree with @yhuai and be
more conservative for better maintainability. For the long run, CNF conversion
can be a promising solution.
2. Performance
Being conservative while dealing with `And` doesn't hurt performance in
most cases because of the way we dealing with Parquet filter predicate
conversion. You may see that, in `ParquetRelation`, the predicate is firstly
split and then converted individually. So a filter predicate like
```
a > 1 AND b < 10 AND weird_udf(c)
```
is not affected. `a > 1` and `b < 10` are still pushed down. (Many
thanks to @yhuai, actually I didn't notice this at first.) But filter
predicates containing nested conjunction(s) are affected. E.g., nothing in the
following predicate can be pushed down:
```
a > 1 OR (b < 10 AND weird_udf(c))
```
(Note that this case can be fixed once we have CNF conversion)
However, we did't do the same thing in `OrcRelation`, so filter
predicates in both cases are affected in ORC. I'm working on a fix for this.
3. Conservative filtering strategy
We just added an `unhandledFilters` API for data sources, so that Spark
SQL doesn't perform the conservative filtering if the underlying data source
implements this method and tells Spark SQL which filters they can't handle.
Parquet and ORC data sources in Spark SQL already implemented this API.
4. Follow-ups
- Backporting #5700 and #8716 to branch-1.5.
- Fixing the ORC filter push-down performance issue mentioned above.
(I'm working on it.)
- Adding CNF conversion to the optimizer, and explicitly guarantees
that all the filter predicate a data source receives are already in CNF. (This
also implies that data sources don't even need to handle `And` anymore.)
- Optionally, it would be good to have an optimization rule for
eliminating negations.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]