[PR] GH-48986: [Python][Dataset] Add filters parameter to orc.read_table() for predicate pushdown (15/15) [arrow]

via GitHub Sat, 07 Feb 2026 22:37:25 -0800


cbb330 opened a new pull request, #49181:
URL: https://github.com/apache/arrow/pull/49181


   ## Summary
   
   Part 15/15 of ORC predicate pushdown implementation.
   
   ⚠️ **Depends on PRs 1-14 being merged first**
   
   Adds Python API for ORC predicate pushdown by exposing a `filters` parameter 
on `orc.read_table()`. This provides API parity with Parquet's `read_table()` 
function.
   
   **This is the final PR in the stacked series.**
   * GitHub Issue: #48986
   
   ## Changes
   
   - Add `filters` parameter to `orc.read_table()` supporting both Expression 
and DNF tuple formats
   - Delegate to Dataset API when filters is specified
   - Add comprehensive documentation with examples in module docstring
   - Add 5 test functions covering smoke tests, integration, and correctness
   
   ## Implementation
   
   The implementation is pure Python with no Cython changes. It reuses existing 
Dataset API bindings and the `filters_to_expression()` utility from Parquet for 
DNF tuple conversion.
   
   When `filters` is specified, the function delegates to:
   ```python
   dataset = ds.dataset(source, format='orc', filesystem=filesystem)
   return dataset.to_table(columns=columns, filter=filter_expr)
   ```
   
   This leverages the C++ predicate pushdown infrastructure added in PRs 1-5.
   
   ## Test Coverage
   
   - Expression format: `ds.field('id') > 100`
   - DNF tuple format: `[('id', '>', 100)]`
   - Integration with column projection
   - Correctness validation against post-filtering
   - Edge case: `filters=None`
   
   ## Examples
   
   **Expression format:**
   ```python
   import pyarrow.orc as orc
   import pyarrow.dataset as ds
   
   table = orc.read_table('data.orc', filters=ds.field('id') > 1000)
   ```
   
   **DNF tuple format (Parquet-compatible):**
   ```python
   # Single condition
   table = orc.read_table('data.orc', filters=[('id', '>', 1000)])
   
   # Multiple conditions (AND)
   table = orc.read_table('data.orc', filters=[('id', '>', 100), ('id', '<', 
200)])
   
   # OR conditions
   table = orc.read_table('data.orc', filters=[[('x', '==', 1)], [('x', '==', 
2)]])
   ```
   
   **With column projection:**
   ```python
   table = orc.read_table('data.orc',
                          columns=['id', 'value'],
                          filters=[('id', '>', 1000)])
   ```
   
   ## Supported Operators
   
   `==`, `!=`, `<`, `>`, `<=`, `>=`, `in`, `not in`
   
   Currently optimized for INT32 and INT64 columns.
   
   ## Rationale
   
   This Python API makes ORC predicate pushdown accessible to Python users 
without requiring them to use the lower-level Dataset API directly. It mirrors 
Parquet's `read_table(filters=...)` API for consistency.
   
   The implementation replaces the placeholder commit from the original plan 
with a full working implementation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] GH-48986: [Python][Dataset] Add filters parameter to orc.read_table() for predicate pushdown (15/15) [arrow]

Reply via email to