yihua commented on code in PR #10325:
URL: https://github.com/apache/hudi/pull/10325#discussion_r1426815873
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java:
##########
@@ -342,11 +342,12 @@ private boolean deleteArchivedInstants(List<ActiveAction>
activeActions, HoodieE
);
}
if (!completedInstants.isEmpty()) {
- context.foreach(
- completedInstants,
- instant -> activeTimeline.deleteInstantFileIfExists(instant),
- Math.min(completedInstants.size(),
config.getArchiveDeleteParallelism())
- );
+ // Due to the concurrency between deleting completed instants and
reading data,
+ // there may be hole in the timeline, which can lead to errors when
reading data.
+ // Therefore, the concurrency of deleting completed instants is
temporarily disabled,
+ // and instants are deleted in ascending order to prevent the occurrence
of such holes.
+ completedInstants.stream()
+ .forEach(instant ->
activeTimeline.deleteInstantFileIfExists(instant));
}
Review Comment:
@majian1998 This is a good find. The issue you mentioned is legit, though
it can only happen during the parallelized deletion of instant files in the
active timeline.
@danny0405 yeah, you pointed the right method. The file group instance uses
the active timeline loaded to determine if a file slice is committed, if either
is true: (1) the active timeline contains the instant time, or (2) the instant
time is smaller than the start of the active timeline.
```
/**
* A FileSlice is considered committed, if one of the following is true -
There is a committed data file - There are
* some log files, that are based off a commit or delta commit.
*/
private boolean isFileSliceCommitted(FileSlice slice) {
if (!compareTimestamps(slice.getBaseInstantTime(),
LESSER_THAN_OR_EQUALS, lastInstant.get().getTimestamp())) {
return false;
}
return
timeline.containsOrBeforeTimelineStarts(slice.getBaseInstantTime());
```
In the case @majian1998 mentions, i.e., `3.deltacommit` is deleted first
from the active timeline `1.deltacommit, 2.deltacommit, and 4.deltacommit`, the
data files written by `3.deltacommit` are considered non-committed in such a
transient state, since neither condition of the above is true.
So, the instant files have to be deleted in order based on the instant time
(not the case as of master). Within the same instant time, the files should be
deleted in the order of `REQUESTED`, `INFLIGHT`, `COMPLETED` (`<ts>.commit`
etc. should be deleted last, which is the case now).
After looking at the history, the deletion did happen in order before until
#3920 introduces the parallelized deletion. We need to revert that. In normal
cases where archival runs for every commit, usually a handful of instant files
are deleted so the performance hit may not be noticeable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]