cbb330 commented on issue #48986:
URL: https://github.com/apache/arrow/issues/48986#issuecomment-3940277126

   ## Implementation Plan
   
   Following @raulcd's and @wgtmac's suggestions, here is the detailed plan for 
ORC predicate pushdown. I will post a consolidated PoC PR (draft) for design 
review, then work through sub-issues per this plan.
   
   ### Architecture
   
   The implementation follows Parquet's proven pattern, adapted for ORC 
semantics:
   
   ```
   Query: dataset.to_table(filter=ds.field("id") > 1000)
   
   1. Adapter layer    → Extract stripe-level min/max statistics from liborc
   2. Expression layer → Convert ORC stats to Arrow guarantee expressions
   3. Filter layer     → SimplifyWithGuarantee() to test each stripe
   4. Scan layer       → Read only stripes that may satisfy the predicate
   5. Post-filter      → Acero row-level filtering for exact results
   ```
   
   ### Key design decisions
   
   **Extend the ORC adapter with a public statistics API** rather than 
accessing liborc types directly from `file_orc.cc`. This keeps a clean 
abstraction boundary — the adapter translates liborc types to Arrow types, and 
the dataset layer works purely with Arrow types. It also makes statistics 
access independently testable and reusable outside the Dataset API.
   
   **Create `OrcFileFragment`** (inheriting `FileFragment`) rather than using 
the generic `FileFragment`. This mirrors `ParquetFileFragment` and is needed to 
store stripe selection state, statistics cache, and metadata. Note: this 
overlaps with #49288 — I've commented there to coordinate.
   
   **Conservative filtering** — when statistics are missing, corrupted, or for 
unsupported types, include the stripe (never produce false negatives).
   
   **Feature flag** — `ARROW_ORC_DISABLE_PREDICATE_PUSHDOWN=1` environment 
variable for instant rollback in production.
   
   ### Sub-issues
   
   I've created the following sub-issues to track each mergeable unit of work:
   
   1. **#GH_SUB1** — Adapter: stripe statistics API on `ORCFileReader`
   2. **#GH_SUB2** — Dataset: `OrcFileFragment` + stripe filtering 
infrastructure
   3. **#GH_SUB3** — Dataset: full operator and type coverage
   4. **#GH_SUB4** — Python: `orc.read_table(filters=...)` API
   
   (I'll update these references once the sub-issues are created.)
   
   ### Differences from Parquet
   
   | Aspect | Parquet | ORC | Impact |
   |--------|---------|-----|--------|
   | Unit of filtering | Row Group | Stripe | Naming only |
   | Column indexing | Schema-ordered, leaf-only | Depth-first pre-order, col 0 
= root struct | Must map Arrow field index → ORC column ID |
   | Null detection | `null_count == num_values` | `hasNull() && 
getNumberOfValues() == 0` | Different check for all-null |
   | Statistics API | `parquet::RowGroupMetaData` | liborc `ColumnStatistics` 
subclasses | Different API, same information |
   | Schema manifest | `parquet::arrow::SchemaManifest` | ORC type tree via 
`Reader::getType()` | Must build field→column mapping |
   
   ### Type support roadmap
   
   | Phase | Types | Notes |
   |-------|-------|-------|
   | Initial | INT32, INT64 | Core integer types, overflow protection for INT32 
|
   | Follow-up | DOUBLE, FLOAT | NaN handling required |
   | Follow-up | STRING | Potential truncation in ORC stats |
   | Follow-up | DATE, TIMESTAMP | Unit conversion (days/millis/nanos) |
   | Future | DECIMAL | Scale/precision matching |
   | Future | Bloom filters | Equality predicates |
   
   ### Test plan
   
   - Stripe filtering with selective predicates (skip/keep verification)
   - All comparison and logical operators
   - NULL handling (IS NULL, IS NOT NULL, all-null stripes)
   - INT32 overflow protection
   - Missing/corrupted statistics → conservative include
   - Feature flag disable
   - Multi-stripe files with known per-stripe statistics
   - Python integration: Expression format, DNF tuple format, column projection
   
   I'll post the consolidated PoC PR shortly for design review. Looking forward 
to feedback on the approach, particularly around the adapter API surface and 
the sub-issue granularity.
   
   cc @wgtmac @raulcd


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to