linyanghao opened a new pull request, #7249: URL: https://github.com/apache/iceberg/pull/7249
When performing a RewriteDataFiles operation, if Iceberg finds new position-delete files that were produced after the starting snapshot of the rewrite, it checks whether these files could potentially contain deletes for the rewritten data files. If they do, then the rewrite operation fails. Currently, the check is based on the upper and lower bounds of the DELETE_FILE_PATH field of the pos-delete records. However, this approach can produce false positives, causing rewrites to fail even when there are no actual conflicts. As a result, it becomes impossible to rewrite a table when it is being written to using streaming CDC (Change Data Capture). To address this issue, this PR proposes adding a new snapshot property, "position-deletes-within-commit-only", which will be set to "true" when CDC-writing using Flink. This property will indicate that the new pos-deletes only refer to data files within the same commit, not commits before it. When checking for conflicts during rewrites, we can then skip the commits generated by Flink CDC-writes. By implementing this change, we can resolve the following issues: https://github.com/apache/iceberg/issues/4996 https://github.com/apache/iceberg/issues/5397 https://github.com/apache/iceberg/issues/6330 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
