[I] ORC Predicate Pushdown [arrow]

via GitHub Mon, 26 Jan 2026 01:33:11 -0800


cbb330 opened a new issue, #48986:
URL: https://github.com/apache/arrow/issues/48986


   ### Describe the enhancement requested
   
   Arrow's ORC reader already supports **column projection** (reading only 
selected columns), but lacks **row-level predicate pushdown**. Currently, 
filtering rows from ORC files requires:
   1. Reading all rows from selected columns (all stripes)
   2. Applying filters post-read using Arrow compute
   
   This is inefficient for large ORC files where only a small subset of rows 
match the filter criteria. ORC files store min/max statistics at the stripe 
level, which can be used to skip entire stripes that cannot contain matching 
rows—avoiding I/O for data that will be filtered out anyway.
   
   ### Use Cases
   
   1. **Data Lake Queries**: Efficiently query large ORC datasets with 
selective predicates
   2. **PyIceberg Integration**: Enable predicate pushdown for Iceberg tables 
stored in ORC format
   3. **Parity with Parquet**: Match the filtering capabilities already 
available for Parquet files
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] ORC Predicate Pushdown [arrow]

Reply via email to