liujiwen-up opened a new pull request, #388:
URL: https://github.com/apache/paimon-rust/pull/388

   ### Purpose
   
   Linked issue: none
   
   This PR adds conservative predicate pushdown support for ORC reads.
   
   Before this change, ORC reads ignored file predicates completely. This meant 
ORC files could not benefit from reader-level row-group pruning, and filtering 
on non-projected columns could fail once the ORC reader needed those columns 
for predicate evaluation.
   
   This change enables safe ORC row-group pruning for a first set of scalar 
predicates while preserving Paimon's existing residual-filter semantics. ORC 
predicate pushdown is treated as a conservative optimization only, not as exact 
filtering.
   
   ### Brief change log
   
   - Translate supported Paimon predicates into `orc-rust` predicates for ORC 
row-group pruning.
   - Support ORC pushdown for:
     - boolean equality
     - tinyint/smallint/int/bigint comparisons
     - string comparisons
     - small `IN` predicates
     - `IS NOT NULL` on supported scalar types
   - Keep unsupported predicates fail-open so they are not pushed down unsafely.
   - Read predicate-only columns internally when they are not part of the 
requested projection, then project output batches back to the requested columns.
   - Document that ORC predicate pushdown is conservative and still requires 
residual filtering above the scan for exact semantics.
   - Add unit tests for predicate translation, fail-open behavior, nested 
compound predicates, and projection restoration.
   - Add integration coverage for ORC reads with predicate-only column 
projection, supported scalar predicate types, conservative semantics, and 
unsupported date predicates remaining residual.
   
   ### Unsupported in this PR
   
   The following predicate pushdown cases are intentionally not enabled yet:
   
   - `FLOAT = literal` and `DOUBLE = literal`
     - Reason: ORC equality pruning may use bloom filters. We need to verify 
that `orc-rust` float/double bloom hashing is compatible with ORC files 
produced by Spark/Java ORC writers before enabling this safely. A false 
negative bloom match could incorrectly skip a row group.
   - Date and timestamp predicates
     - Reason: this first PR keeps the supported type surface small. 
Date/timestamp ORC statistics need separate validation for encoding, timezone, 
and unit semantics before enabling.
   - Decimal, binary, nested types, and complex predicates
     - Reason: these require additional type-specific validation and test 
fixtures.
   - `NOT`, `NOT IN`, `!=`, and partially supported predicates under `OR`
     - Reason: OR predicates must be pushed down only when all branches are 
safely representable. Otherwise row groups could be incorrectly pruned.
   
   These unsupported cases fail open and remain residual filters. They can be 
added incrementally in follow-up PRs after dedicated compatibility tests are 
available.
   
   ### Tests
   
   - `cargo fmt --check`: passed
   - `cargo test -p paimon arrow::format::orc::tests --lib`: passed
   - `cargo test -p paimon-integration-tests test_read_orc --no-run`: passed
   
   Note: the ORC integration tests require the provisioned integration-test 
warehouse to run fully. In this local environment, full execution was not run 
because `default.full_types_table` was not present.
   
   ### API and Format
   
   No public API changes.
   
   No table format, snapshot, manifest, or persisted metadata format changes.
   
   ### Documentation
   
   Inline reader documentation was updated to clarify that ORC predicate 
pushdown is conservative row-group pruning, not exact row-level filtering.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to