cbb330 opened a new issue, #49363:
URL: https://github.com/apache/arrow/issues/49363
### Summary
Part 4 of ORC predicate pushdown (#48986). Depends on #49361.
Add a `filters` parameter to `pyarrow.orc.read_table()` for API parity with
Parquet's `read_table()`. This makes ORC predicate pushdown accessible to
Python users without requiring the lower-level Dataset API.
### Changes
**`python/pyarrow/orc.py`:**
Add `filters` parameter to `read_table()`. When specified, delegate to the
Dataset API:
```python
def read_table(source, columns=None, filesystem=None, filters=None):
if filters is not None:
import pyarrow.dataset as ds
filter_expr = filters
if not isinstance(filters, ds.Expression):
filter_expr = ds.filters_to_expression(filters)
dataset = ds.dataset(source, format='orc', filesystem=filesystem)
return dataset.to_table(columns=columns, filter=filter_expr)
# ... existing non-filter path unchanged
```
**Supported filter formats:**
- Expression format: `ds.field('id') > 100`
- DNF tuple format: `[('id', '>', 100)]` (Parquet-compatible)
- Supported operators: `==`, `!=`, `<`, `>`, `<=`, `>=`, `in`, `not in`
**No Cython changes.** This is pure Python, reusing existing Dataset API
bindings and the `filters_to_expression()` utility already used by Parquet.
### Examples
```python
import pyarrow.orc as orc
import pyarrow.dataset as ds
# Expression format
table = orc.read_table('data.orc', filters=ds.field('id') > 1000)
# DNF tuple format
table = orc.read_table('data.orc', filters=[('id', '>', 1000)])
# Multiple conditions (AND)
table = orc.read_table('data.orc', filters=[('id', '>', 100), ('id', '<',
200)])
# With column projection
table = orc.read_table('data.orc', columns=['id', 'value'],
filters=[('id', '>', 1000)])
```
### Tests
Tests in `python/pyarrow/tests/test_orc.py`:
- Expression format smoke test
- DNF tuple format smoke test
- Integration with column projection
- Correctness validation: filtered result matches post-filter of full read
- `filters=None` preserves existing behavior
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]