geruh commented on code in PR #3285:
URL: https://github.com/apache/iceberg-python/pull/3285#discussion_r3204180249
##########
pyiceberg/table/delete_file_index.py:
##########
@@ -76,6 +90,36 @@ def _applies_to_data_file(delete_file: DataFile, data_file:
DataFile) -> bool:
return evaluator.eval(delete_file)
+def _eq_applies_to_data_file(eq_delete_file: DataFile, data_file: DataFile,
schema: Schema) -> bool:
+ if not eq_delete_file.equality_ids:
Review Comment:
Looking at [java's matching function
](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/DeleteFileIndex.java#L252-L258)
in the dfi.
They have two additional pruning branches for null-counts.
1. Data file is all null for an equality field and the delete has zero null
rows therefore, delete can't match, and we prune.
2. Delete file is all null for an equality field and the data has zero null
rows therefore, delete targets nulls that don't exist, and we prune.
Without these, we are grouping files that can be skipped for given data
files. Ultimately, we should have correct results even with the over inclusion
of delete files but are paying a price of I/O.
example scenario:
```python
# eq dekete targets only null rows
eq = _create_equality_delete(
sequence_number=10,
null_value_counts={1: 10},
value_counts={1: 10},
)
# Data file has zero nulls
data = _create_data_file(
null_value_counts={1: 0},
value_counts={1: 100},
)
index.add_delete_file(eq)
# Java would prune this to 0
assert len(index.for_data_file(1, data)) == 1
```
It's definitely worth either doing here or in a a follow up!
##########
tests/table/test_delete_file_index.py:
##########
@@ -187,3 +218,113 @@ def test_record_equality_for_partition_lookup() -> None:
assert len(index.for_data_file(1, data_file, partition_b)) == 1
assert len(index.for_data_file(1, data_file, partition_c)) == 0
+
+
+def test_equality_delete_sequence_number_filtering() -> None:
+ index = DeleteFileIndex()
+
+ # Equality delete with sequence number 2
+ index.add_delete_file(_create_equality_delete(sequence_number=2))
+
+ data_file = _create_data_file()
+
+ # Data file with sequence number 1 should be affected by equality delete
with sequence number 2
+ assert len(index.for_data_file(1, data_file)) == 1
+
+ # Data file with sequence number 2 should NOT be affected by equality
delete with sequence number 2
+ # Equality deletes apply only to data files added in strictly earlier
snapshots (seq - 1)
+ assert len(index.for_data_file(2, data_file)) == 0
+
+ # Data file with sequence number 3 should NOT be affected
+ assert len(index.for_data_file(3, data_file)) == 0
+
+
+def test_equality_delete_sequence_number_unpartitioned() -> None:
+ data_file = _create_data_file()
+
+ # Create both types of deletes at sequence number 10
+ pos_delete = _create_positional_delete(sequence_number=10)
+ eq_delete = _create_equality_delete(sequence_number=10)
+
+ index = DeleteFileIndex()
+ index.add_delete_file(pos_delete)
+ index.add_delete_file(eq_delete)
+
+ # Sequence 10
+ deletes = index.for_data_file(10, data_file)
+ assert len(deletes) == 1
+ # Position deletes will apply (applies to seq <= 10)
+ assert pos_delete.data_file in deletes
+ # Equality deletes will not (applies to seq < 10)
+ assert eq_delete.data_file not in deletes
+
+ # Sequence 9
+ # Both deletes will apply.
+ deletes = index.for_data_file(9, data_file)
+ assert len(deletes) == 2
+ assert pos_delete.data_file in deletes
+ assert eq_delete.data_file in deletes
+
+ # At sequence 111
Review Comment:
nit: seq 11
##########
pyiceberg/io/pyarrow.py:
##########
@@ -1693,7 +1693,12 @@ def _task_to_record_batches(
def _read_all_delete_files(io: FileIO, tasks: Iterable[FileScanTask]) ->
dict[str, list[ChunkedArray]]:
Review Comment:
I was trying to say this is redundant with below rendering this line
unreachable. But this is in th right direction for the follow up. I wanted to
say that we aren't ignoring equality deletes we are throwing if a scan contains
them
https://github.com/apache/iceberg-python/blob/3a7413a9de2346b4b50b004ab17d6ec4cce2da41/pyiceberg/table/__init__.py#L2059-L2075
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]