cbb330 opened a new issue, #49361:
URL: https://github.com/apache/arrow/issues/49361
### Summary
Part 2 of ORC predicate pushdown (#48986). Depends on #49360.
Add the dataset-layer infrastructure for ORC predicate pushdown, modeled
after `ParquetFileFragment`. This is the core of the feature — it connects
stripe statistics to Arrow's expression simplification engine to skip stripes
at scan time.
Note: The `OrcFileFragment` and stripe subsetting API overlap with #49288.
Coordinating there.
### Changes
**New class** `OrcFileFragment` (inheriting `FileFragment`) in `file_orc.h`:
```cpp
class OrcFileFragment : public FileFragment {
std::optional<std::vector<int>> stripes_;
std::vector<compute::Expression> statistics_expressions_;
std::vector<bool> statistics_expressions_complete_;
std::mutex physical_schema_mutex_;
};
```
**Key methods:**
- `DeriveFieldGuarantee(stripe_stats, field)` — Convert ORC stripe
statistics to an Arrow guarantee expression (`field >= min AND field <= max [OR
is_null(field)]`). Uses the adapter's `GetStripeColumnStatistics()` API.
- `TestStripes(predicate)` — For each stripe, build the guarantee expression
from statistics, then call `SimplifyWithGuarantee(predicate, guarantee)`.
Returns per-stripe simplified expressions.
- `FilterStripes(predicate)` — Wrapper over `TestStripes()` that returns the
list of stripe indices where `IsSatisfiable()` is true.
- `Subset(stripe_ids)` — Create a new fragment representing only the
specified stripes (addresses #49288).
**Scanner integration:**
Wire `FilterStripes()` into `OrcFileFormat::ScanBatchesAsync()` so the scan
reads only stripes that pass the predicate. Uses `ORCFileReader::Seek()` +
`NextStripeReader()` for efficient streaming (not `ReadStripes()` which
materializes to Table).
**Infrastructure:**
- Lazy evaluation: only process statistics for fields referenced in the
predicate
- Statistics caching: avoid re-parsing stripe statistics on repeated access
- Thread safety: `physical_schema_mutex_` guards cached metadata (same
pattern as Parquet)
- Feature flag: `ARROW_ORC_DISABLE_PREDICATE_PUSHDOWN=1` to bypass filtering
- ORC column index mapping: Arrow field index → ORC depth-first pre-order
column ID (col 0 = root struct)
**Initial type support:** INT32, INT64
- INT32 overflow protection: if liborc int64 stats exceed INT32 bounds,
conservatively include the stripe
### Tests
Dataset-layer tests in `file_orc_test.cc`:
- Stripe filtering with selective predicates (verify correct stripes
skipped/kept)
- Greater-than predicate on multi-stripe file
- Feature flag disables pushdown
- Missing statistics → conservative include
- INT32 overflow → conservative include
- Multi-stripe file with known per-stripe value ranges
- Explicit stripe selection via `Subset()`
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]