rynewang opened a new pull request, #48716:
URL: https://github.com/apache/arrow/pull/48716
## Rationale for this change
Predicate pushdown for Parquet was broken for nullable columns with range
statistics (min ≠ max),
which is the vast majority of real-world data. This caused row groups to be
read even when predicates
could definitively exclude them.
The root cause: Parquet statistics for nullable columns generate guarantees
of the form:
or_(and_(field >= min, field <= max), is_null(field))
However, Inequality::ExtractOne() only handled single comparisons inside
or_(..., is_null), not
and_(...) expressions. This meant no inequalities were extracted and
SimplifyWithGuarantee() could not
simplify predicates.
This affected all predicates on nullable columns:
- Comparisons: equal, less, greater, less_equal, greater_equal
- Set membership: is_in
See: https://github.com/apache/arrow/issues/36283
## What changes are included in this PR?
Added ExpandNullableRangeGuarantees() which transforms:
or_(and_(A, B), is_null(x))
into:
[or_(A, is_null(x)), or_(B, is_null(x))]
This expansion is logically valid because (A ∧ B) ∨ C ≡ (A ∨ C) ∧ (B ∨ C).
Each expanded guarantee can
then be processed by existing simplification logic.
Also handles the reversed form or_(is_null(x), and_(...)).
## Are these changes tested?
Yes. Added two new test cases:
- SimplifyWithNullableRangeGuarantee - tests all comparison operators with
nullable range guarantees
- SimplifyIsInWithNullableRangeGuarantee - tests is_in with nullable range
guarantees
Both tests fail without the fix and pass with it.
## Are there any user-facing changes?
No API changes. Users will see improved query performance when filtering
nullable columns in Parquet
files, as row groups can now be correctly skipped based on min/max
statistics.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]