voonhous commented on code in PR #18544:
URL: https://github.com/apache/hudi/pull/18544#discussion_r3246015116


##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -2098,9 +2098,25 @@ public static Set<String> 
getValidInstantTimestamps(HoodieTableMetaClient dataMe
     // For any rollbacks and restores, we cannot neglect the instants that 
they are rolling back.
     // The rollback instant should be more recent than the start of the 
timeline for it to have rolled back any
     // instant which we have a log block for.
+    //
+    // Only read rollback metadata for rollbacks newer than the latest MDT 
compaction.
+    // After compaction, rolled-back log blocks are already merged into base 
files, so pre-compaction
+    // rollback timestamps are no longer needed for log block filtering. This 
avoids sequential storage
+    // reads for old rollback instants that can cause long latency during 
metadata table reading.
     final String earliestInstantTime = validInstantTimestamps.isEmpty() ? 
SOLO_COMMIT_TIMESTAMP : Collections.min(validInstantTimestamps);
+    final String latestMdtCompactionTime = 
metadataMetaClient.getActiveTimeline()
+        .getCommitTimeline()

Review Comment:
   Building ontop of this, if anything else ever writes a **COMMIT_ACTION** to 
MDT, this would silently treat that timestamp as a "compaction." It's worth 
being defensive, consider filtering explicitly on the compaction action.
   
   IIRC, **COMMIT_ACTION** writes to MDT are exclusively generated by 
compaction, so this is safe for now. 
   
   The only problem that may arise in the future is if there's a change in 
contract API, and this becomes a regression. 
   
   As of now, i don't think this should be a blocker, just want to highlight 
this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to