Reo-LEI commented on issue #3102: URL: https://github.com/apache/iceberg/issues/3102#issuecomment-919073751
> But if some of the files owned by the old snapshot are deleted by mistake. @coolderli Thanks for your attention, but I think we don't need to worry about this case, because 'IcebergFilesCommitter' only validate the data files which are referenced by the not commited pos-delete files. And pos-delete file will only referenced the same txn data file. That is mean the referenced data files will not owned by other snapshot and only will be the uncommitted data files. So I don't think your case will happend, but we can go a step further and discuss all possible situations. We assume that the table already has a historical snapshot. **Case-1:** flink job **first start** and **not** uncommitted data. `IcebergFilesCommitter` will do nothing and `lastCommittedSnapshotId` will be init as init value(-1). **Case-2:** flink job **restore** from checkpoint and **not** uncommitted data. `IcebergFilesCommitter` will do nothing and `lastCommittedSnapshotId` will be init as init value(-1). **Case-3:** flink job **restore** from checkpoint and **have** uncommitted data. `IcebergFilesCommitter` will commit all uncommitted data. First, `lastCommittedSnapshotId` will be init as init value(-1), and then committer will validate all data files which is referenced by uncommitted pos-delete files from current snapshot to `lastCommittedSnapshotId`. Because `lastCommittedSnapshotId` value is -1, so committer will travel all snapshot history to ensure data files still exist and guarantee all snapshot history are valid. After that, `lastCommittedSnapshotId` will be update to the commited snapshotId. **Case-4:** flink job keep **running** and **not** uncommitted data. `IcebergFilesCommitter` will do nothing and `lastCommittedSnapshotId` will keep its value. **Case-5:** flink job keep **running** and **have** uncommitted data. `IcebergFilesCommitter` will commit all uncommitted data, and validate all referenced data files from current snapshot to `lastCommittedSnapshotId`. And then update `lastCommittedSnapshotId` the commited snapshotId. That can ensure all referenced data files are exist and not be delete between to commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
