rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677020238
##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -297,6 +283,39 @@ protected void validateAddedDataFiles(TableMetadata base,
Long startingSnapshotI
}
}
+ /**
+ * Validates that no new delete files that must be applied to the given data
files have been added to the table since
+ * a starting snapshot.
+ *
+ * @param base table metadata to validate
+ * @param startingSnapshotId id of the snapshot current at the start of the
operation
+ * @param dataFiles data files to validate have no new row deletes
+ */
+ protected void validateNoNewDeletesForDataFiles(TableMetadata base, Long
startingSnapshotId,
+ Iterable<DataFile>
dataFiles) {
+ // if there is no current table state, no files have been added
+ if (base.currentSnapshot() == null) {
+ return;
+ }
+
+ Pair<List<ManifestFile>, Set<Long>> history =
+ validationHistory(base, startingSnapshotId,
VALIDATE_REPLACED_DATA_FILES_OPERATIONS, ManifestContent.DELETES);
+ List<ManifestFile> deleteManifests = history.first();
+
+ long startingSequenceNumber = startingSnapshotId == null ? 0 :
base.snapshot(startingSnapshotId).sequenceNumber();
+ DeleteFileIndex deletes = DeleteFileIndex.builderFor(ops.io(),
deleteManifests)
+ .afterSequenceNumber(startingSequenceNumber)
+ .specsById(ops.current().specsById())
+ .build();
+
+ for (DataFile dataFile : dataFiles) {
Review comment:
The data files here are the ones that are in memory because they are
being replaced. I don't think that this is going to be a significant slow-down
since we're just doing an index check, but we can follow up with a couple
improvements to make it faster.
One improvement I'd opt for before avoiding this loop is to only read
manifest files that were created in the new snapshots. That is, when the
snapshot ID of the delete file is one of the snapshots newer than the starting
snapshot. We don't currently do that because the delete file index builder
doesn't support it and it looked more invasive to update the index builder (and
we're trying to get 0.12.0 out).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]