Reo-LEI commented on issue #3102:
URL: https://github.com/apache/iceberg/issues/3102#issuecomment-919073751


   > But if some of the files owned by the old snapshot are deleted by mistake.
   
   @coolderli Thanks for your attention, but I think we don't need to worry 
about this case, because 'IcebergFilesCommitter' only validate the data files 
which are referenced by the not commited pos-delete files. And pos-delete file 
will only referenced the same txn data file. That is mean the referenced data 
files will not owned by other snapshot and only will be the uncommitted data 
files.
   
   So I don't think your case will happend, but we can go a step further and 
discuss all possible situations. We assume that the table already has a 
historical snapshot.
   
   **Case-1:**  flink job **first start** and **not** uncommitted data.
   `IcebergFilesCommitter` will do nothing and `lastCommittedSnapshotId` will 
be init as init value(-1).
   
   **Case-2:**  flink job **restore** from checkpoint and **not** uncommitted 
data.
   `IcebergFilesCommitter` will do nothing and `lastCommittedSnapshotId` will 
be init as init value(-1).
   
   **Case-3:**  flink job **restore** from checkpoint and **have** uncommitted 
data.
   `IcebergFilesCommitter` will commit all uncommitted data. First, 
`lastCommittedSnapshotId` will be init as init value(-1), and then committer 
will validate all data files which is referenced by uncommitted pos-delete 
files from current snapshot to `lastCommittedSnapshotId`. Because 
`lastCommittedSnapshotId` value is -1, so committer will travel all snapshot 
history to ensure data files still exist and guarantee all snapshot history are 
valid. After that, `lastCommittedSnapshotId` will be update to the commited 
snapshotId.
   
   **Case-4:**  flink job keep **running** and **not** uncommitted data.
   `IcebergFilesCommitter` will do nothing and `lastCommittedSnapshotId` will 
keep its value.
   
   **Case-5:**  flink job keep **running** and **have** uncommitted data.
   `IcebergFilesCommitter` will commit all uncommitted data, and validate all 
referenced data files from current snapshot to `lastCommittedSnapshotId`. And 
then update `lastCommittedSnapshotId` the commited snapshotId. That can ensure 
all referenced data files are exist and not be delete between to commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to