majian1998 opened a new pull request, #10325:
URL: https://github.com/apache/hudi/pull/10325
This PR is more about discussing with everyone how to fix the existing
issues.
Assuming there are 4 instants in a Hudi table that need to be archived, with
timestamps in ascending order (as they have been sorted after obtaining
instantToArchive): these are 1.deltacommit, 2.deltacommit, 3.deltacommit, and
4.deltacommit, corresponding to the files a.parquet, b.parquet, c.parquet, and
d.parquet, respectively.
In the archiving code, the deletion of instants is handled by the following
code snippet:
```
if (!completedInstants.isEmpty()) {
context.foreach(
completedInstants,
instant -> activeTimeline.deleteInstantFileIfExists(instant),
Math.min(completedInstants.size(),
config.getArchiveDeleteParallelism())
);
}
```
Different instants are distributed across different threads for execution.
For instance, in Spark with a parallelism of 2, they would be distributed as 1
and 2, and 3 and 4. Consequently, there may be scenarios where instant 3 is
deleted before instant 2. If instants 1 and 3 are deleted while 2 and 4 are not
yet deleted, a query request obtaining visibleCommitsAndCompactionTimeline at
this point would find a timeline with instants 2, 4, and so on.
During a query, this would result in the data under c.parquet, corresponding
to instant 3, becoming completely invisible. I believe this is a very
problematic situation, as users could unknowingly retrieve incorrect data.
Here are a few potential solutions I've considered:
1.Prohibit concurrent deletion of completed files. While this would ensure
the order of deletions, it could significantly impact performance, which is not
an optimal solution. Serially deleting instants may be slower, but as there are
usually few to remove, it is an acceptable stopgap. The change can be undone
when a superior solution is found.
2.Implement a solution similar to a marker file, recording which instants
are in the process of being deleted, and then remove these instants directly
from the timeline during reads.
3.Based on the second solution, incorporate archiving by adding archive
instants to the timeline, allowing for direct retrieval of pending archives
during data reads. Here I have a question: why don't previous archives have
corresponding instant action?
### Change Logs
delete complete instants serially.
### Impact
None
### Risk level (write none, low medium or high below)
Low
### Documentation Update
None
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]