cbb330 opened a new issue, #48986: URL: https://github.com/apache/arrow/issues/48986
### Describe the enhancement requested Arrow's ORC reader already supports **column projection** (reading only selected columns), but lacks **row-level predicate pushdown**. Currently, filtering rows from ORC files requires: 1. Reading all rows from selected columns (all stripes) 2. Applying filters post-read using Arrow compute This is inefficient for large ORC files where only a small subset of rows match the filter criteria. ORC files store min/max statistics at the stripe level, which can be used to skip entire stripes that cannot contain matching rows—avoiding I/O for data that will be filtered out anyway. ### Use Cases 1. **Data Lake Queries**: Efficiently query large ORC datasets with selective predicates 2. **PyIceberg Integration**: Enable predicate pushdown for Iceberg tables stored in ORC format 3. **Parity with Parquet**: Match the filtering capabilities already available for Parquet files ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
