amogh-jahagirdar commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1058714858


##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -260,10 +259,14 @@ public void cleanFiles(TableMetadata beforeExpiration, 
TableMetadata afterExpira
         findFilesToDelete(manifestsToScan, manifestsToRevert, validIds, 
afterExpiration);
 
     deleteFiles(filesToDelete, "data");
-    LOG.warn("Manifests to delete: {}", Joiner.on(", 
").join(manifestsToDelete));
-    LOG.warn("Manifests Lists to delete: {}", Joiner.on(", 
").join(manifestListsToDelete));
     deleteFiles(manifestsToDelete, "manifest");
     deleteFiles(manifestListsToDelete, "manifest list");
+
+    if (!beforeExpiration.statisticsFiles().isEmpty()) {
+      Set<String> expiredStatisticsFilesLocations =
+          expiredStatisticsFilesLocations(beforeExpiration, afterExpiration);
+      deleteFiles(expiredStatisticsFilesLocations, "statistics files");
+    }

Review Comment:
   Nit: I think we could just have expireStatisticsFilesLocation short circuit 
before computing  in case before expiration is empty. Rather than have the 
if(!beforeExpiration.statisticsFiles().isEmpty()) in both cleanup 
implementations



##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1234,6 +1243,95 @@ public void 
testMultipleRefsAndCleanExpiredFilesFailsForIncrementalCleanup() {
                 .commit());
   }
 
+  @Test
+  public void testExpireWithStatisticsFiles() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table);
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    String statsFileLocation2 = statsFileLocation(table);
+    StatisticsFile statisticsFile2 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation2,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile2);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile2.snapshotId());
+
+    Assert.assertEquals("Should have 2 statistics file", 2, 
table.statisticsFiles().size());
+
+    long tAfterCommits = 
waitUntilAfter(table.currentSnapshot().timestampMillis());
+    removeSnapshots(table).expireOlderThan(tAfterCommits).commit();
+
+    Assert.assertEquals("Should keep 1 snapshot", 1, 
Iterables.size(table.snapshots()));
+    Assertions.assertThat(table.statisticsFiles())
+        .hasSize(1)
+        .extracting(StatisticsFile::snapshotId)
+        .as("Should contain only the statistics file of snapshot2")
+        .isEqualTo(Lists.newArrayList(statisticsFile2.snapshotId()));
+
+    Assertions.assertThat(new File(statsFileLocation1).exists()).isFalse();
+    Assertions.assertThat(new File(statsFileLocation2).exists()).isTrue();
+  }
+
+  @Test
+  public void testExpireWithStatisticsFilesWithReuse() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table);
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    // Note: In the real scenario, appendFile will always create a new 
statistics file.
+    // Only rewrite_data_files scenario can reuse the statistics file with new 
snapshot id.
+    // To avoid simulating rewrite, this test is reusing the stats for append 
operation itself.

Review Comment:
   Just a nit, I think this inline comment could be a bit more concise, maybe 
something like:
   
   "RewriteDataFiles can reuse statistics files across operations. This test 
reuses stats for append just to mimic this scenario without having to run 
RewriteDataFiles." 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to