rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677020238



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -297,6 +283,39 @@ protected void validateAddedDataFiles(TableMetadata base, 
Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no new delete files that must be applied to the given data 
files have been added to the table since
+   * a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the 
operation
+   * @param dataFiles data files to validate have no new row deletes
+   */
+  protected void validateNoNewDeletesForDataFiles(TableMetadata base, Long 
startingSnapshotId,
+                                                  Iterable<DataFile> 
dataFiles) {
+    // if there is no current table state, no files have been added
+    if (base.currentSnapshot() == null) {
+      return;
+    }
+
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, 
VALIDATE_REPLACED_DATA_FILES_OPERATIONS, ManifestContent.DELETES);
+    List<ManifestFile> deleteManifests = history.first();
+
+    long startingSequenceNumber = startingSnapshotId == null ? 0 : 
base.snapshot(startingSnapshotId).sequenceNumber();
+    DeleteFileIndex deletes = DeleteFileIndex.builderFor(ops.io(), 
deleteManifests)
+        .afterSequenceNumber(startingSequenceNumber)
+        .specsById(ops.current().specsById())
+        .build();
+
+    for (DataFile dataFile : dataFiles) {

Review comment:
       The data files here are the ones that are in memory because they are 
being replaced. I don't think that this is going to be a significant slow-down 
since we're just doing an index check, but we can follow up with a couple 
improvements to make it faster.
   
   One improvement I'd opt for before avoiding this loop is to only read 
manifest files that were created in the new snapshots. That is, when the 
snapshot ID of the delete file is one of the snapshots newer than the starting 
snapshot. We don't currently do that because the delete file index builder 
doesn't support it and it looked more invasive to update the index builder (and 
we're trying to get 0.12.0 out).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to