rdblue commented on a change in pull request #2603:
URL: https://github.com/apache/iceberg/pull/2603#discussion_r652149167



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -314,16 +315,29 @@ protected void validateDataFilesExist(TableMetadata base, 
Long startingSnapshotI
           "Cannot determine history between starting snapshot %s and current 
%s",
           startingSnapshotId, currentSnapshotId);
 
+      // if current snapshot's parent was expired and starting snapshot is 
null,
+      // add all manifests of current snapshot and break
+      Long parentId = currentSnapshot.parentId();
+      boolean shouldAddManifestsAndBreak =
+          startingSnapshotId == null && parentId != null && 
ops.current().snapshot(parentId) == null;

Review comment:
       I think that the logic using `shouldAddManifestsAndBreak` is incorrect. 
The situation this is trying to account for is when the entire table history 
should be considered (`startingSnapshotId` is `null`), but only partial history 
is available (`parentId` is not `null` but the snapshot is missing). I think 
the idea here is that this can account for lost history by looking across all 
manifests for a delete.
   
   I don't think that this is implemented correctly for a few reasons:
   * If the oldest available snapshot is an append snapshot, then the outer 
check below for operation will prevent its manifests from being added as 
expected.
   * The manifest group that scans for deletes filters entries to just the 
snapshot ids that were new since the starting id (`newSnapshots`). When history 
is missing, that filter would need to be turned off or else the entries will be 
ignored even if you scan older manifests.
   * When reading the manifests to check for deletes, missing history could 
have removed those deletes in a manifest rewrite. If I delete file A, it will 
show up in a delete entry in a snapshot. But once I rewrite the manifest that 
encoded that delete, the delete is no longer kept. So expiring history can 
still cause a problem.
   
   Because of that last flaw, I don't think that there is a way to do this 
safely by searching for deletes. The only way to do this validation without 
history is to scan the entire table for the required data files.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to