yihua commented on code in PR #10325:
URL: https://github.com/apache/hudi/pull/10325#discussion_r1426815873


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java:
##########
@@ -342,11 +342,12 @@ private boolean deleteArchivedInstants(List<ActiveAction> 
activeActions, HoodieE
       );
     }
     if (!completedInstants.isEmpty()) {
-      context.foreach(
-          completedInstants,
-          instant -> activeTimeline.deleteInstantFileIfExists(instant),
-          Math.min(completedInstants.size(), 
config.getArchiveDeleteParallelism())
-      );
+      // Due to the concurrency between deleting completed instants and 
reading data,
+      // there may be hole in the timeline, which can lead to errors when 
reading data.
+      // Therefore, the concurrency of deleting completed instants is 
temporarily disabled,
+      // and instants are deleted in ascending order to prevent the occurrence 
of such holes.
+      completedInstants.stream()
+          .forEach(instant -> 
activeTimeline.deleteInstantFileIfExists(instant));
     }

Review Comment:
   @majian1998 This is a good find.  The issue you mentioned is legit, though 
it can only happen during the parallelized deletion of instant files in the 
active timeline.
   
   @danny0405 yeah, you pointed the right method.  The file group instance uses 
the active timeline loaded to determine if a file slice is committed, if either 
is true: (1) the active timeline contains the instant time, or (2) the instant 
time is smaller than the start of the active timeline.
   
   In the case @majian1998 mentions, i.e., `3.deltacommit` is deleted first 
from the active timeline `1.deltacommit, 2.deltacommit, and 4.deltacommit`, the 
data files written by `3.deltacommit` are considered non-committed in such a 
transient state, since neither condition of the above is true.
   
   So, the instant files have to be deleted in order based on the instant time 
(not the case as of master).  Within the same instant time, the files should be 
deleted in the order of `REQUESTED`, `INFLIGHT`, `COMPLETED` (`<ts>.commit` 
etc. should be deleted last, which is the case now).
   
   After looking at the history, the deletion did happen in order before until 
#3920 introduces the parallelized deletion.  We need to revert that.  In normal 
cases where archival runs for every commit, usually a handful of instant files 
are deleted so the performance hit may not be noticeable.
   
   ```
     /**
      * A FileSlice is considered committed, if one of the following is true - 
There is a committed data file - There are
      * some log files, that are based off a commit or delta commit.
      */
     private boolean isFileSliceCommitted(FileSlice slice) {
       if (!compareTimestamps(slice.getBaseInstantTime(), 
LESSER_THAN_OR_EQUALS, lastInstant.get().getTimestamp())) {
         return false;
       }
   
       return 
timeline.containsOrBeforeTimelineStarts(slice.getBaseInstantTime());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to