rdblue commented on a change in pull request #2603:
URL: https://github.com/apache/iceberg/pull/2603#discussion_r652149167
##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -314,16 +315,29 @@ protected void validateDataFilesExist(TableMetadata base,
Long startingSnapshotI
"Cannot determine history between starting snapshot %s and current
%s",
startingSnapshotId, currentSnapshotId);
+ // if current snapshot's parent was expired and starting snapshot is
null,
+ // add all manifests of current snapshot and break
+ Long parentId = currentSnapshot.parentId();
+ boolean shouldAddManifestsAndBreak =
+ startingSnapshotId == null && parentId != null &&
ops.current().snapshot(parentId) == null;
Review comment:
I think that the logic using `shouldAddManifestsAndBreak` is incorrect.
The situation this is trying to account for is when the entire table history
should be considered (`startingSnapshotId` is `null`), but only partial history
is available (`parentId` is not `null` but the snapshot is missing). I think
the idea here is that this can account for lost history by looking across all
manifests for a delete.
I don't think that this is implemented correctly for a few reasons:
* If the oldest available snapshot is an append snapshot, then the outer
check below for operation will prevent its manifests from being added as
expected.
* The manifest group that scans for deletes filters entries to just the
snapshot ids that were new since the starting id (`newSnapshots`). When history
is missing, that filter would need to be turned off or else the entries will be
ignored even if you scan older manifests.
* When reading the manifests to check for deletes, missing history could
have removed those deletes in a manifest rewrite. If I delete file A, it will
show up in a delete entry in a snapshot. But once I rewrite the manifest that
encoded that delete, the delete is no longer kept. So expiring history can
still cause a problem.
Because of that last flaw, I don't think that there is a way to do this
safely by searching for deletes. The only way to do this validation without
history is to scan the entire table for the required data files.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]