cbb330 opened a new issue, #49361:
URL: https://github.com/apache/arrow/issues/49361

   ### Summary
   
   Part 2 of ORC predicate pushdown (#48986). Depends on #49360.
   
   Add the dataset-layer infrastructure for ORC predicate pushdown, modeled 
after `ParquetFileFragment`. This is the core of the feature — it connects 
stripe statistics to Arrow's expression simplification engine to skip stripes 
at scan time.
   
   Note: The `OrcFileFragment` and stripe subsetting API overlap with #49288. 
Coordinating there.
   
   ### Changes
   
   **New class** `OrcFileFragment` (inheriting `FileFragment`) in `file_orc.h`:
   
   ```cpp
   class OrcFileFragment : public FileFragment {
       std::optional<std::vector<int>> stripes_;
       std::vector<compute::Expression> statistics_expressions_;
       std::vector<bool> statistics_expressions_complete_;
       std::mutex physical_schema_mutex_;
   };
   ```
   
   **Key methods:**
   
   - `DeriveFieldGuarantee(stripe_stats, field)` — Convert ORC stripe 
statistics to an Arrow guarantee expression (`field >= min AND field <= max [OR 
is_null(field)]`). Uses the adapter's `GetStripeColumnStatistics()` API.
   
   - `TestStripes(predicate)` — For each stripe, build the guarantee expression 
from statistics, then call `SimplifyWithGuarantee(predicate, guarantee)`. 
Returns per-stripe simplified expressions.
   
   - `FilterStripes(predicate)` — Wrapper over `TestStripes()` that returns the 
list of stripe indices where `IsSatisfiable()` is true.
   
   - `Subset(stripe_ids)` — Create a new fragment representing only the 
specified stripes (addresses #49288).
   
   **Scanner integration:**
   
   Wire `FilterStripes()` into `OrcFileFormat::ScanBatchesAsync()` so the scan 
reads only stripes that pass the predicate. Uses `ORCFileReader::Seek()` + 
`NextStripeReader()` for efficient streaming (not `ReadStripes()` which 
materializes to Table).
   
   **Infrastructure:**
   
   - Lazy evaluation: only process statistics for fields referenced in the 
predicate
   - Statistics caching: avoid re-parsing stripe statistics on repeated access
   - Thread safety: `physical_schema_mutex_` guards cached metadata (same 
pattern as Parquet)
   - Feature flag: `ARROW_ORC_DISABLE_PREDICATE_PUSHDOWN=1` to bypass filtering
   - ORC column index mapping: Arrow field index → ORC depth-first pre-order 
column ID (col 0 = root struct)
   
   **Initial type support:** INT32, INT64
   - INT32 overflow protection: if liborc int64 stats exceed INT32 bounds, 
conservatively include the stripe
   
   ### Tests
   
   Dataset-layer tests in `file_orc_test.cc`:
   - Stripe filtering with selective predicates (verify correct stripes 
skipped/kept)
   - Greater-than predicate on multi-stripe file
   - Feature flag disables pushdown
   - Missing statistics → conservative include
   - INT32 overflow → conservative include
   - Multi-stripe file with known per-stripe value ranges
   - Explicit stripe selection via `Subset()`
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to