hudi-agent commented on code in PR #18544:
URL: https://github.com/apache/hudi/pull/18544#discussion_r3212272767


##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -2098,9 +2098,25 @@ public static Set<String> 
getValidInstantTimestamps(HoodieTableMetaClient dataMe
     // For any rollbacks and restores, we cannot neglect the instants that 
they are rolling back.
     // The rollback instant should be more recent than the start of the 
timeline for it to have rolled back any
     // instant which we have a log block for.
+    //
+    // Only read rollback metadata for rollbacks newer than the latest MDT 
compaction.
+    // After compaction, rolled-back log blocks are already merged into base 
files, so pre-compaction
+    // rollback timestamps are no longer needed for log block filtering. This 
avoids sequential storage
+    // reads for old rollback instants that can cause long latency during 
metadata table reading.
     final String earliestInstantTime = validInstantTimestamps.isEmpty() ? 
SOLO_COMMIT_TIMESTAMP : Collections.min(validInstantTimestamps);
+    final String latestMdtCompactionTime = 
metadataMetaClient.getActiveTimeline()
+        .getCommitTimeline()

Review Comment:
   🤖 The existing `HoodieBackedTableMetadata.getLatestCompactionTime()` (line 
808) uses `getCommitAndReplaceTimeline()` which also includes 
`REPLACE_COMMIT_ACTION` / `CLUSTERING_ACTION`, while this new code uses only 
`getCommitTimeline()` (just `COMMIT_ACTION`). Was this intentional? It's safe 
today since MDT only emits compaction commits as `COMMIT_ACTION`, but the 
inconsistency is a small future-proofing risk if MDT ever gains 
clustering/replace semantics — and reusing/sharing the existing helper would 
also avoid the duplicated lookup logic.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-hadoop-common/src/test/java/org/apache/hudi/metadata/TestHoodieTableMetadataUtil.java:
##########
@@ -869,4 +878,89 @@ private static Stream<Arguments> 
mapKeyNoSeparatorToFileGroupIndexTestCases() {
         )
     );
   }
+
+  /**
+   * Tests getValidInstantTimestamps rollback handling:
+   * - Without MDT compaction, all rollback metadata is read (rolled-back 
commits appear in valid timestamps).
+   * - With MDT compaction, only post-compaction rollback metadata is read 
(pre-compaction rollbacks are skipped
+   *   because those log blocks are already merged into base files).
+   */
+  @Test
+  void testGetValidInstantTimestampsSkipsPreCompactionRollbacks() throws 
Exception {
+    HoodieTestTable testTable = HoodieTestTable.of(metaClient);
+
+    String commit1 = "20260101010101000";
+    String commit2 = "20260201010101000";
+    String commit3 = "20260301010101000";
+    String commit4 = "20260501010101000";
+    String commit5 = "20260601010101000";
+    testTable.addCommit(commit1);
+    testTable.addCommit(commit2);
+    testTable.addCommit(commit3);
+    testTable.addCommit(commit4);
+    testTable.addCommit(commit5);
+
+    // Rollbacks before MDT compaction time
+    addCompletedRollback(testTable, "20260202010101000", commit2);
+    addCompletedRollback(testTable, "20260302010101000", commit3);
+    // Rollback after MDT compaction time
+    addCompletedRollback(testTable, "20260502010101000", commit4);
+
+    // Delete rolled-back commit instants from the timeline to simulate real 
rollback behavior.
+    // In a real system, the commit instant file is removed when a rollback 
completes, so the
+    // only way these timestamps appear in validInstantTimestamps is via 
rollback metadata reading.
+    metaClient = HoodieTableMetaClient.reload(metaClient);
+    for (String rolledBack : Arrays.asList(commit2, commit3, commit4)) {
+      HoodieInstant completedCommit = metaClient.getInstantGenerator()
+          .createNewInstant(HoodieInstant.State.COMPLETED, 
HoodieTimeline.COMMIT_ACTION, rolledBack);
+      
metaClient.getActiveTimeline().deleteInstantFileIfExists(completedCommit);
+    }
+
+    // Create MDT metaClient with NO compaction initially (only delta commits)
+    String mdtBasePath = 
HoodieTableMetadata.getMetadataTableBasePath(basePath);
+    HoodieTestUtils.init(mdtBasePath, HoodieTableType.MERGE_ON_READ);
+    HoodieTableMetaClient mdtMetaClient = HoodieTableMetaClient.builder()
+        .setBasePath(mdtBasePath)
+        .build();
+    HoodieTestTable mdtTestTable = HoodieTestTable.of(mdtMetaClient);
+    mdtTestTable.addDeltaCommit("20260101020101000");
+
+    metaClient = HoodieTableMetaClient.reload(metaClient);
+    mdtMetaClient = HoodieTableMetaClient.reload(mdtMetaClient);
+
+    // Without MDT compaction, all rollback metadata is read — rolled-back 
commits appear
+    Set<String> validTimestamps = 
HoodieTableMetadataUtil.getValidInstantTimestamps(metaClient, mdtMetaClient);
+    assertTrue(validTimestamps.contains(commit1), "commit1 should be in valid 
timestamps");
+    assertTrue(validTimestamps.contains(commit2), "commit2 should be in valid 
timestamps (from rollback metadata read)");
+    assertTrue(validTimestamps.contains(commit3), "commit3 should be in valid 
timestamps (from rollback metadata read)");
+    assertTrue(validTimestamps.contains(commit4), "commit4 should be in valid 
timestamps (from rollback metadata read)");
+    assertTrue(validTimestamps.contains(commit5), "commit5 should be in valid 
timestamps");
+
+    // Now add a compaction commit to MDT at a time between rollback2 and 
rollback3
+    mdtTestTable.addCommit("20260401010101000");
+    mdtTestTable.addDeltaCommit("20260501020101000");
+
+    metaClient = HoodieTableMetaClient.reload(metaClient);
+    mdtMetaClient = HoodieTableMetaClient.reload(mdtMetaClient);
+
+    // With MDT compaction, only post-compaction rollback (commit4) metadata 
is read;
+    // pre-compaction rollbacks for commit2 and commit3 are skipped
+    validTimestamps = 
HoodieTableMetadataUtil.getValidInstantTimestamps(metaClient, mdtMetaClient);
+    assertTrue(validTimestamps.contains(commit1), "commit1 should be in valid 
timestamps");
+    assertTrue(validTimestamps.contains(commit5), "commit5 should be in valid 
timestamps");
+    assertTrue(validTimestamps.contains(commit4), "commit4 should be in valid 
timestamps (from post-compaction rollback)");
+    assertFalse(validTimestamps.contains(commit2), "commit2 should NOT be in 
valid timestamps (pre-compaction rollback skipped)");
+    assertFalse(validTimestamps.contains(commit3), "commit3 should NOT be in 
valid timestamps (pre-compaction rollback skipped)");
+  }
+
+  private void addCompletedRollback(HoodieTestTable testTable, String 
rollbackTime, String rolledBackCommit) throws Exception {

Review Comment:
   🤖 nit: `emptyPartitionFiles` reads as though the map is empty, but it 
actually contains a `partition1` entry. Something like `partitionFiles` or 
`partitionToFiles` might be less surprising.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-hadoop-common/src/test/java/org/apache/hudi/metadata/TestHoodieTableMetadataUtil.java:
##########
@@ -869,4 +878,89 @@ private static Stream<Arguments> 
mapKeyNoSeparatorToFileGroupIndexTestCases() {
         )
     );
   }
+
+  /**
+   * Tests getValidInstantTimestamps rollback handling:
+   * - Without MDT compaction, all rollback metadata is read (rolled-back 
commits appear in valid timestamps).
+   * - With MDT compaction, only post-compaction rollback metadata is read 
(pre-compaction rollbacks are skipped
+   *   because those log blocks are already merged into base files).
+   */
+  @Test
+  void testGetValidInstantTimestampsSkipsPreCompactionRollbacks() throws 
Exception {
+    HoodieTestTable testTable = HoodieTestTable.of(metaClient);
+
+    String commit1 = "20260101010101000";
+    String commit2 = "20260201010101000";
+    String commit3 = "20260301010101000";
+    String commit4 = "20260501010101000";
+    String commit5 = "20260601010101000";
+    testTable.addCommit(commit1);
+    testTable.addCommit(commit2);
+    testTable.addCommit(commit3);
+    testTable.addCommit(commit4);
+    testTable.addCommit(commit5);
+
+    // Rollbacks before MDT compaction time
+    addCompletedRollback(testTable, "20260202010101000", commit2);
+    addCompletedRollback(testTable, "20260302010101000", commit3);
+    // Rollback after MDT compaction time
+    addCompletedRollback(testTable, "20260502010101000", commit4);
+
+    // Delete rolled-back commit instants from the timeline to simulate real 
rollback behavior.
+    // In a real system, the commit instant file is removed when a rollback 
completes, so the
+    // only way these timestamps appear in validInstantTimestamps is via 
rollback metadata reading.
+    metaClient = HoodieTableMetaClient.reload(metaClient);
+    for (String rolledBack : Arrays.asList(commit2, commit3, commit4)) {
+      HoodieInstant completedCommit = metaClient.getInstantGenerator()
+          .createNewInstant(HoodieInstant.State.COMPLETED, 
HoodieTimeline.COMMIT_ACTION, rolledBack);
+      
metaClient.getActiveTimeline().deleteInstantFileIfExists(completedCommit);
+    }
+
+    // Create MDT metaClient with NO compaction initially (only delta commits)
+    String mdtBasePath = 
HoodieTableMetadata.getMetadataTableBasePath(basePath);
+    HoodieTestUtils.init(mdtBasePath, HoodieTableType.MERGE_ON_READ);
+    HoodieTableMetaClient mdtMetaClient = HoodieTableMetaClient.builder()
+        .setBasePath(mdtBasePath)
+        .build();
+    HoodieTestTable mdtTestTable = HoodieTestTable.of(mdtMetaClient);
+    mdtTestTable.addDeltaCommit("20260101020101000");
+
+    metaClient = HoodieTableMetaClient.reload(metaClient);
+    mdtMetaClient = HoodieTableMetaClient.reload(mdtMetaClient);

Review Comment:
   🤖 nit: the comment is cut off mid-sentence — "rolled-back commits appear" 
appears to be missing the end of the thought. Could you complete it, e.g. 
"rolled-back commits appear in the valid timestamps"?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to