linyanghao opened a new pull request, #7249:
URL: https://github.com/apache/iceberg/pull/7249

   When performing a RewriteDataFiles operation, if Iceberg finds new 
position-delete files that were produced after the starting snapshot of the 
rewrite, it checks whether these files could potentially contain deletes for 
the rewritten data files. If they do, then the rewrite operation fails.
   
   Currently, the check is based on the upper and lower bounds of the 
DELETE_FILE_PATH field of the pos-delete records. However, this approach can 
produce false positives, causing rewrites to fail even when there are no actual 
conflicts. As a result, it becomes impossible to rewrite a table when it is 
being written to using streaming CDC (Change Data Capture).
   
   To address this issue, this PR proposes adding a new snapshot property, 
"position-deletes-within-commit-only", which will be set to "true" when 
CDC-writing using Flink. This property will indicate that the new pos-deletes 
only refer to data files within the same commit, not commits before it. When 
checking for conflicts during rewrites, we can then skip the commits generated 
by Flink CDC-writes.
   
   By implementing this change, we can resolve the following issues:
   
   https://github.com/apache/iceberg/issues/4996
   https://github.com/apache/iceberg/issues/5397
   https://github.com/apache/iceberg/issues/6330


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to