rynewang opened a new pull request, #48716:
URL: https://github.com/apache/arrow/pull/48716

   ##  Rationale for this change
   
   Predicate pushdown for Parquet was broken for nullable columns with range 
statistics (min ≠ max),
   which is the vast majority of real-world data. This caused row groups to be 
read even when predicates
   could definitively exclude them.
   
   The root cause: Parquet statistics for nullable columns generate guarantees 
of the form:
   or_(and_(field >= min, field <= max), is_null(field))
   
   However, Inequality::ExtractOne() only handled single comparisons inside 
or_(..., is_null), not
   and_(...) expressions. This meant no inequalities were extracted and 
SimplifyWithGuarantee() could not
    simplify predicates.
   
   This affected all predicates on nullable columns:
   - Comparisons: equal, less, greater, less_equal, greater_equal
   - Set membership: is_in
   
   See: https://github.com/apache/arrow/issues/36283
   
   ## What changes are included in this PR?
   
   Added ExpandNullableRangeGuarantees() which transforms:
   or_(and_(A, B), is_null(x))
   into:
   [or_(A, is_null(x)), or_(B, is_null(x))]
   
   This expansion is logically valid because (A ∧ B) ∨ C ≡ (A ∨ C) ∧ (B ∨ C). 
Each expanded guarantee can
    then be processed by existing simplification logic.
   
   Also handles the reversed form or_(is_null(x), and_(...)).
   
   ## Are these changes tested?
   
   Yes. Added two new test cases:
   - SimplifyWithNullableRangeGuarantee - tests all comparison operators with 
nullable range guarantees
   - SimplifyIsInWithNullableRangeGuarantee - tests is_in with nullable range 
guarantees
   
   Both tests fail without the fix and pass with it.
   
   ## Are there any user-facing changes?
   
   No API changes. Users will see improved query performance when filtering 
nullable columns in Parquet
   files, as row groups can now be correctly skipped based on min/max 
statistics.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to