geruh commented on code in PR #3285:
URL: https://github.com/apache/iceberg-python/pull/3285#discussion_r3204180249


##########
pyiceberg/table/delete_file_index.py:
##########
@@ -76,6 +90,36 @@ def _applies_to_data_file(delete_file: DataFile, data_file: 
DataFile) -> bool:
     return evaluator.eval(delete_file)
 
 
+def _eq_applies_to_data_file(eq_delete_file: DataFile, data_file: DataFile, 
schema: Schema) -> bool:
+    if not eq_delete_file.equality_ids:

Review Comment:
   Looking at [java's matching function 
](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/DeleteFileIndex.java#L252-L258)
 in the dfi. 
   
   They have two additional pruning branches for null-counts. 
   
   1.  Data file is all null for an equality field and the delete has zero null 
rows therefore, delete can't match, and we prune.
   2. Delete file is all null for an equality field and the data has zero null 
rows therefore, delete targets nulls that don't exist, and we prune. 
   
   Without these, we are grouping files that can be skipped for given data 
files. Ultimately, we should have correct results even with the over inclusion 
of delete files but are paying a price of I/O. 
   
   example scenario:
   
   ```python
   
   # eq dekete targets only null rows
   eq = _create_equality_delete(
      sequence_number=10,
      null_value_counts={1: 10},
      value_counts={1: 10},
   )
   
   # Data file has zero nulls
   data = _create_data_file(
     null_value_counts={1: 0},
     value_counts={1: 100},
   )
   
   index.add_delete_file(eq)
   
   # Java would prune this to 0
   assert len(index.for_data_file(1, data)) == 1
   ```
   
   It's definitely worth either doing here or in a a follow up! 



##########
tests/table/test_delete_file_index.py:
##########
@@ -187,3 +218,113 @@ def test_record_equality_for_partition_lookup() -> None:
 
     assert len(index.for_data_file(1, data_file, partition_b)) == 1
     assert len(index.for_data_file(1, data_file, partition_c)) == 0
+
+
+def test_equality_delete_sequence_number_filtering() -> None:
+    index = DeleteFileIndex()
+
+    # Equality delete with sequence number 2
+    index.add_delete_file(_create_equality_delete(sequence_number=2))
+
+    data_file = _create_data_file()
+
+    # Data file with sequence number 1 should be affected by equality delete 
with sequence number 2
+    assert len(index.for_data_file(1, data_file)) == 1
+
+    # Data file with sequence number 2 should NOT be affected by equality 
delete with sequence number 2
+    # Equality deletes apply only to data files added in strictly earlier 
snapshots (seq - 1)
+    assert len(index.for_data_file(2, data_file)) == 0
+
+    # Data file with sequence number 3 should NOT be affected
+    assert len(index.for_data_file(3, data_file)) == 0
+
+
+def test_equality_delete_sequence_number_unpartitioned() -> None:
+    data_file = _create_data_file()
+
+    # Create both types of deletes at sequence number 10
+    pos_delete = _create_positional_delete(sequence_number=10)
+    eq_delete = _create_equality_delete(sequence_number=10)
+
+    index = DeleteFileIndex()
+    index.add_delete_file(pos_delete)
+    index.add_delete_file(eq_delete)
+
+    # Sequence 10
+    deletes = index.for_data_file(10, data_file)
+    assert len(deletes) == 1
+    # Position deletes will apply (applies to seq <= 10)
+    assert pos_delete.data_file in deletes
+    # Equality deletes will not (applies to seq < 10)
+    assert eq_delete.data_file not in deletes
+
+    # Sequence 9
+    # Both deletes will apply.
+    deletes = index.for_data_file(9, data_file)
+    assert len(deletes) == 2
+    assert pos_delete.data_file in deletes
+    assert eq_delete.data_file in deletes
+
+    # At sequence 111

Review Comment:
   nit: seq 11



##########
pyiceberg/io/pyarrow.py:
##########
@@ -1693,7 +1693,12 @@ def _task_to_record_batches(
 
 def _read_all_delete_files(io: FileIO, tasks: Iterable[FileScanTask]) -> 
dict[str, list[ChunkedArray]]:

Review Comment:
   I was trying to say this is redundant with below rendering this line 
unreachable. But this is in th right direction for the follow up. I wanted to 
say that we aren't ignoring equality deletes we are throwing if a scan contains 
them
   
   
   
https://github.com/apache/iceberg-python/blob/3a7413a9de2346b4b50b004ab17d6ec4cce2da41/pyiceberg/table/__init__.py#L2059-L2075
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to