codope commented on code in PR #9871:
URL: https://github.com/apache/hudi/pull/9871#discussion_r1378785387


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java:
##########
@@ -270,13 +270,41 @@ private List<HoodieInstant> getCommitInstantsToArchive() 
throws IOException {
             // stop at first savepoint commit
             return !firstSavepoint.isPresent() || 
compareTimestamps(s.getTimestamp(), LESSER_THAN, 
firstSavepoint.get().getTimestamp());
           }
-        }).filter(s -> earliestInstantToRetain
-            .map(instant -> compareTimestamps(s.getTimestamp(), LESSER_THAN, 
instant.getTimestamp()))
+        })
+        .filter(s -> earliestInstantToRetain.map(
+            instant -> compareTimestamps(s.getTimestamp(), LESSER_THAN, 
instant.getTimestamp())
+                // for the compaction c2 in metadata table triggered by commit 
c1 in data table,
+                // c2.getTimestamp() > c1.getTimestamp() because: c2 happens 
before c1 completes,
+                // and we are generating new instant time for c2 after c1 has 
started. Effectively,

Review Comment:
   Lines 208-222 is fine as it is trying to get the earliest candidate instant 
to retain. The main filtering happens after all such candidate instants have 
been collected. I think the fix should be where the filtering happens.
   
   > And this issue also exists for the master code, the compaction instant is 
delta_commit + “001” which happens before delta_commit with a greater instant 
time
   
   That is not the case. Let's say the data table had a commit 00000006 and now 
00000007 happened which triggered the compaction in MDT. As per master code, 
first the compaction of MDT will be done, which will take the instant time 
00000006001 and not 00000007001, and 00000006001 < 00000007. In this PR, we 
have changed the compaction instant time to be the new instant time at that 
point. So, it's always greater than the start time of the commit that triggered 
it.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java:
##########
@@ -270,13 +270,41 @@ private List<HoodieInstant> getCommitInstantsToArchive() 
throws IOException {
             // stop at first savepoint commit
             return !firstSavepoint.isPresent() || 
compareTimestamps(s.getTimestamp(), LESSER_THAN, 
firstSavepoint.get().getTimestamp());
           }
-        }).filter(s -> earliestInstantToRetain
-            .map(instant -> compareTimestamps(s.getTimestamp(), LESSER_THAN, 
instant.getTimestamp()))
+        })
+        .filter(s -> earliestInstantToRetain.map(
+            instant -> compareTimestamps(s.getTimestamp(), LESSER_THAN, 
instant.getTimestamp())
+                // for the compaction c2 in metadata table triggered by commit 
c1 in data table,
+                // c2.getTimestamp() > c1.getTimestamp() because: c2 happens 
before c1 completes,
+                // and we are generating new instant time for c2 after c1 has 
started. Effectively,

Review Comment:
   Lines 208-222 is fine as it is trying to get the earliest candidate instant 
to retain. The main filtering happens after all such candidate instants have 
been collected. I think the fix should be where the filtering happens.
   
   > And this issue also exists for the master code, the compaction instant is 
delta_commit + “001” which happens before delta_commit with a greater instant 
time
   
   That is not the case. Let's say the data table had a commit 00000006 and now 
00000007 happened which triggered the compaction in MDT. As per master code, 
first the compaction of MDT will be done, which will take the instant time 
00000006001 and not 00000007001, and 00000006001 < 00000007. In this PR, we 
have changed the compaction instant time to be the new instant time at that 
point. So, it's always greater than the start time of the commit that triggered 
it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to