[I] [Python][Dataset] Add filters parameter to orc.read_table() for predicate pushdown [arrow]

via GitHub Sat, 21 Feb 2026 22:30:05 -0800


cbb330 opened a new issue, #49363:
URL: https://github.com/apache/arrow/issues/49363


   ### Summary
   
   Part 4 of ORC predicate pushdown (#48986). Depends on #49361.
   
   Add a `filters` parameter to `pyarrow.orc.read_table()` for API parity with 
Parquet's `read_table()`. This makes ORC predicate pushdown accessible to 
Python users without requiring the lower-level Dataset API.
   
   ### Changes
   
   **`python/pyarrow/orc.py`:**
   
   Add `filters` parameter to `read_table()`. When specified, delegate to the 
Dataset API:
   
   ```python
   def read_table(source, columns=None, filesystem=None, filters=None):
       if filters is not None:
           import pyarrow.dataset as ds
           filter_expr = filters
           if not isinstance(filters, ds.Expression):
               filter_expr = ds.filters_to_expression(filters)
           dataset = ds.dataset(source, format='orc', filesystem=filesystem)
           return dataset.to_table(columns=columns, filter=filter_expr)
       # ... existing non-filter path unchanged
   ```
   
   **Supported filter formats:**
   
   - Expression format: `ds.field('id') > 100`
   - DNF tuple format: `[('id', '>', 100)]` (Parquet-compatible)
   - Supported operators: `==`, `!=`, `<`, `>`, `<=`, `>=`, `in`, `not in`
   
   **No Cython changes.** This is pure Python, reusing existing Dataset API 
bindings and the `filters_to_expression()` utility already used by Parquet.
   
   ### Examples
   
   ```python
   import pyarrow.orc as orc
   import pyarrow.dataset as ds
   
   # Expression format
   table = orc.read_table('data.orc', filters=ds.field('id') > 1000)
   
   # DNF tuple format
   table = orc.read_table('data.orc', filters=[('id', '>', 1000)])
   
   # Multiple conditions (AND)
   table = orc.read_table('data.orc', filters=[('id', '>', 100), ('id', '<', 
200)])
   
   # With column projection
   table = orc.read_table('data.orc', columns=['id', 'value'],
                          filters=[('id', '>', 1000)])
   ```
   
   ### Tests
   
   Tests in `python/pyarrow/tests/test_orc.py`:
   - Expression format smoke test
   - DNF tuple format smoke test
   - Integration with column projection
   - Correctness validation: filtered result matches post-filter of full read
   - `filters=None` preserves existing behavior
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Python][Dataset] Add filters parameter to orc.read_table() for predicate pushdown [arrow]

Reply via email to