liujiwen-up opened a new pull request, #388:
URL: https://github.com/apache/paimon-rust/pull/388
### Purpose
Linked issue: none
This PR adds conservative predicate pushdown support for ORC reads.
Before this change, ORC reads ignored file predicates completely. This meant
ORC files could not benefit from reader-level row-group pruning, and filtering
on non-projected columns could fail once the ORC reader needed those columns
for predicate evaluation.
This change enables safe ORC row-group pruning for a first set of scalar
predicates while preserving Paimon's existing residual-filter semantics. ORC
predicate pushdown is treated as a conservative optimization only, not as exact
filtering.
### Brief change log
- Translate supported Paimon predicates into `orc-rust` predicates for ORC
row-group pruning.
- Support ORC pushdown for:
- boolean equality
- tinyint/smallint/int/bigint comparisons
- string comparisons
- small `IN` predicates
- `IS NOT NULL` on supported scalar types
- Keep unsupported predicates fail-open so they are not pushed down unsafely.
- Read predicate-only columns internally when they are not part of the
requested projection, then project output batches back to the requested columns.
- Document that ORC predicate pushdown is conservative and still requires
residual filtering above the scan for exact semantics.
- Add unit tests for predicate translation, fail-open behavior, nested
compound predicates, and projection restoration.
- Add integration coverage for ORC reads with predicate-only column
projection, supported scalar predicate types, conservative semantics, and
unsupported date predicates remaining residual.
### Unsupported in this PR
The following predicate pushdown cases are intentionally not enabled yet:
- `FLOAT = literal` and `DOUBLE = literal`
- Reason: ORC equality pruning may use bloom filters. We need to verify
that `orc-rust` float/double bloom hashing is compatible with ORC files
produced by Spark/Java ORC writers before enabling this safely. A false
negative bloom match could incorrectly skip a row group.
- Date and timestamp predicates
- Reason: this first PR keeps the supported type surface small.
Date/timestamp ORC statistics need separate validation for encoding, timezone,
and unit semantics before enabling.
- Decimal, binary, nested types, and complex predicates
- Reason: these require additional type-specific validation and test
fixtures.
- `NOT`, `NOT IN`, `!=`, and partially supported predicates under `OR`
- Reason: OR predicates must be pushed down only when all branches are
safely representable. Otherwise row groups could be incorrectly pruned.
These unsupported cases fail open and remain residual filters. They can be
added incrementally in follow-up PRs after dedicated compatibility tests are
available.
### Tests
- `cargo fmt --check`: passed
- `cargo test -p paimon arrow::format::orc::tests --lib`: passed
- `cargo test -p paimon-integration-tests test_read_orc --no-run`: passed
Note: the ORC integration tests require the provisioned integration-test
warehouse to run fully. In this local environment, full execution was not run
because `default.full_types_table` was not present.
### API and Format
No public API changes.
No table format, snapshot, manifest, or persisted metadata format changes.
### Documentation
Inline reader documentation was updated to clarify that ORC predicate
pushdown is conservative row-group pruning, not exact row-level filtering.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]