cbb330 commented on issue #48986:
URL: https://github.com/apache/arrow/issues/48986#issuecomment-3940277126
## Implementation Plan
Following @raulcd's and @wgtmac's suggestions, here is the detailed plan for
ORC predicate pushdown. I will post a consolidated PoC PR (draft) for design
review, then work through sub-issues per this plan.
### Architecture
The implementation follows Parquet's proven pattern, adapted for ORC
semantics:
```
Query: dataset.to_table(filter=ds.field("id") > 1000)
1. Adapter layer → Extract stripe-level min/max statistics from liborc
2. Expression layer → Convert ORC stats to Arrow guarantee expressions
3. Filter layer → SimplifyWithGuarantee() to test each stripe
4. Scan layer → Read only stripes that may satisfy the predicate
5. Post-filter → Acero row-level filtering for exact results
```
### Key design decisions
**Extend the ORC adapter with a public statistics API** rather than
accessing liborc types directly from `file_orc.cc`. This keeps a clean
abstraction boundary — the adapter translates liborc types to Arrow types, and
the dataset layer works purely with Arrow types. It also makes statistics
access independently testable and reusable outside the Dataset API.
**Create `OrcFileFragment`** (inheriting `FileFragment`) rather than using
the generic `FileFragment`. This mirrors `ParquetFileFragment` and is needed to
store stripe selection state, statistics cache, and metadata. Note: this
overlaps with #49288 — I've commented there to coordinate.
**Conservative filtering** — when statistics are missing, corrupted, or for
unsupported types, include the stripe (never produce false negatives).
**Feature flag** — `ARROW_ORC_DISABLE_PREDICATE_PUSHDOWN=1` environment
variable for instant rollback in production.
### Sub-issues
I've created the following sub-issues to track each mergeable unit of work:
1. **#GH_SUB1** — Adapter: stripe statistics API on `ORCFileReader`
2. **#GH_SUB2** — Dataset: `OrcFileFragment` + stripe filtering
infrastructure
3. **#GH_SUB3** — Dataset: full operator and type coverage
4. **#GH_SUB4** — Python: `orc.read_table(filters=...)` API
(I'll update these references once the sub-issues are created.)
### Differences from Parquet
| Aspect | Parquet | ORC | Impact |
|--------|---------|-----|--------|
| Unit of filtering | Row Group | Stripe | Naming only |
| Column indexing | Schema-ordered, leaf-only | Depth-first pre-order, col 0
= root struct | Must map Arrow field index → ORC column ID |
| Null detection | `null_count == num_values` | `hasNull() &&
getNumberOfValues() == 0` | Different check for all-null |
| Statistics API | `parquet::RowGroupMetaData` | liborc `ColumnStatistics`
subclasses | Different API, same information |
| Schema manifest | `parquet::arrow::SchemaManifest` | ORC type tree via
`Reader::getType()` | Must build field→column mapping |
### Type support roadmap
| Phase | Types | Notes |
|-------|-------|-------|
| Initial | INT32, INT64 | Core integer types, overflow protection for INT32
|
| Follow-up | DOUBLE, FLOAT | NaN handling required |
| Follow-up | STRING | Potential truncation in ORC stats |
| Follow-up | DATE, TIMESTAMP | Unit conversion (days/millis/nanos) |
| Future | DECIMAL | Scale/precision matching |
| Future | Bloom filters | Equality predicates |
### Test plan
- Stripe filtering with selective predicates (skip/keep verification)
- All comparison and logical operators
- NULL handling (IS NULL, IS NOT NULL, all-null stripes)
- INT32 overflow protection
- Missing/corrupted statistics → conservative include
- Feature flag disable
- Multi-stripe files with known per-stripe statistics
- Python integration: Expression format, DNF tuple format, column projection
I'll post the consolidated PoC PR shortly for design review. Looking forward
to feedback on the approach, particularly around the adapter API surface and
the sub-issue granularity.
cc @wgtmac @raulcd
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]