ndrluis commented on code in PR #3448:
URL: https://github.com/apache/iceberg-python/pull/3448#discussion_r3359711150
##########
pyiceberg/io/pyarrow.py:
##########
@@ -1676,22 +1676,13 @@ def _task_to_record_batches(
# Create the mask of indices that we're interested in
indices = _combine_positional_deletes(positional_deletes,
current_index, current_index + len(batch))
current_batch = current_batch.take(indices)
+ if pyarrow_filter is not None:
+ current_batch = current_batch.filter(pyarrow_filter)
Review Comment:
This change replaces the Table-based filtering workaround.
PyArrow 17-20 raise `IndexError` when `RecordBatch.filter(Expression)`
produces zero rows. PyArrow 21 fixed this behavior, but we still declares
support for `pyarrow>=17.0.0`.
On PyArrow 17-20 this fails with `IndexError: list index out of range`
We should either keep that workaround for supported PyArrow versions or
raise the minimum supported version to `pyarrow>=21.0.0`.
A regression test should cover the positional-delete path where the
post-delete row filter produces an empty result, for example:
```python
def
test_task_to_record_batches_filter_after_positional_deletes_empty_result(tmpdir:
str) -> None:
from pyiceberg.expressions.visitors import bind
arrow_schema = pa.schema(
(
pa.field(
"id",
pa.int32(),
nullable=True,
metadata={PYARROW_PARQUET_FIELD_ID_KEY: "1"},
),
)
)
arrow_table = pa.table([pa.array([1, 2, 3], type=pa.int32())],
schema=arrow_schema)
data_file = _write_table_to_data_file(
f"{tmpdir}/test_filter_after_positional_deletes_empty_result.parquet",
arrow_schema,
arrow_table,
)
table_schema = Schema(NestedField(1, "id", IntegerType(),
required=False))
positional_deletes = [pa.chunked_array([pa.array([],
type=pa.int64())])]
result_batches = list(
_task_to_record_batches(
PyArrowFileIO(),
FileScanTask(data_file),
bound_row_filter=bind(table_schema, GreaterThan("id", 10),
case_sensitive=True),
projected_schema=table_schema,
table_schema=table_schema,
projected_field_ids={1},
positional_deletes=positional_deletes,
case_sensitive=True,
)
)
assert result_batches == []
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]