mxm commented on PR #14182: URL: https://github.com/apache/iceberg/pull/14182#issuecomment-3337290281
>@mxm Even with this change, the data loss will still occur for WriteResults with delete files in the scenario described in https://github.com/apache/iceberg/issues/14090. For example, consider the case when the DynamicCommitter fails after the first committed RowDelta in the private void commitDeltaTxn() method: I have to politely disagree with you here. We commit in two cases: 1. Whenever a checkpoint contains delete files (only exception, we haven't yet processed any checkpoints) 2. At the end of processing all checkpoints and their WriteResults Since the smallest unit at which we do a table snapshot is per Flink checkpoint, we will always be able to recover the commit state by looking up the highest committed checkpoint id from the table summary which is kept per table/branch. >The complete solution is to aggregate all WriteResults for a (checkpoint, table, branch) triplet, which I implemented in https://github.com/apache/iceberg/pull/14092 in DynamicWriteResultAggregator. It is valid to aggregate delete files from WriteResults within a single checkpoint because all changes within a checkpoint are logically concurrent and get the same sequence number when committed. This is precisely what this change does. There is an additional optimization to also optimize multiple commits. I think this makes the code hard to review. I'm going to revert this change and only commit each checkpoint. This makes the code easier to reason about. Also, the situation where we have WriteResults from multiple Flink checkpoints is very rare to occur. >Combining WriteResults across checkpoints for appends is a different story. It is valid to do this in DynamicCommitter because it is the only logical place in the code that has the context across multiple checkpoints, while DynamicWriteResultAggregator always operates within a single checkpoint. For the sake of simplicity, I will revert the change to combine WriteResults from multiple Flink checkpoints. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org