rdblue commented on issue #2482: URL: https://github.com/apache/iceberg/issues/2482#issuecomment-897017924
@ayush-san, let me make sure I understand. Say a table has snapshots 1, 2, 3, 4, and 5, and that there is a Flink streaming process that has consumed through snapshot 2. Then, someone expires snapshots and removes 1, 2, and 3 so the table only has snapshots 4 and 5. The validation that is failing looks at 5, then 4, then sees that there is no snapshot 3. Because it can't provide the rows appended in 3, it fails. It sounds like you're suggesting that we could keep track of the oldest snapshot in the table and use that for startingSnapshotId. So instead of the Flink job starting from snapshot 2 and reading snapshots 3, 4, and 5, it would start from snapshot 3 and read only 4 and 5. The problem is that this automatically skips rows from snapshot 3, which isn't correct. It also sounds like you're suggesting that we annotate a snapshot, which I don't think would be necessary because we could always traverse parents from the current table state and only process the snapshots that are available. What the committer is doing is different. The committer keeps the checkpoint ID in the snapshot summary so that it can check whether a given checkpoint has been committed to the table already. That makes commits exactly-once when the snapshot history is longer than the time the application was unable to commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
