rambleraptor commented on PR #3320:
URL: https://github.com/apache/iceberg-python/pull/3320#issuecomment-4675718202

   Alright, thanks for hanging on. There's a lot of complicated logic here I'm 
trying to get through.
   
   _resolve_parent_snapshot only looks at the committed metadata, not the 
commits that need to happen. That means you get into this very fun situation if 
you have mixed deletes.
   
   Here's the test I wrote up:
   
   ```
   def test_mixed_delete_overwrite_retries_successfully(catalog: Catalog) -> 
None:
       """A mixed full-file + partial delete should succeed via retry, not 
raise ValidationException."""
       from pyiceberg.partitioning import PartitionField, PartitionSpec
       from pyiceberg.transforms import IdentityTransform
   
       catalog.create_namespace("default")
       schema = Schema(
           NestedField(1, "category", StringType(), required=False),
           NestedField(2, "value", LongType(), required=False),
       )
       spec = PartitionSpec(PartitionField(source_id=1, field_id=1000, 
transform=IdentityTransform(), name="category"))
       catalog.create_table("default.mixed_retry_test", schema=schema, 
partition_spec=spec)
   
       import pyarrow as pa
   
       tbl = catalog.load_table("default.mixed_retry_test")
       
       # 3 partitions, one data file each: a→[1,2], b→[3,4], c→[5,6]
       tbl.append(pa.table({"category": ["a", "a", "b", "b", "c", "c"], 
"value": [1, 2, 3, 4, 5, 6]}))
   
       tbl1 = catalog.load_table("default.mixed_retry_test")
       tbl2 = catalog.load_table("default.mixed_retry_test")
   
       
       tbl1.append(pa.table({"category": ["c"], "value": [7]}))
   
       # This is your problem.
       # This is in multiple partitions.
       # partition 'a' is a partial rewrite (a has 1,2 - we're only deleting 
1), we get _OverwriteFiles
       # partition 'b' is a full rewrite (category == 'b'), we get _DeleteFiles
       tbl2.delete("value == 1 or category == 'b'")
   
       result = catalog.load_table("default.mixed_retry_test").scan().to_arrow()
       assert sorted(result.column("value").to_pylist()) == [2, 5, 6, 7]
   ```
   
   What would you think about creating some kind of `CommitWindow` class that 
tracks all of the commits that have been made since we attempted to commit? I'm 
hoping that would make it easier for us to understand the code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to