Ma Jian created HUDI-7207:
-----------------------------
Summary: Concurrent archiving and data reading leads to missing
data in query results.
Key: HUDI-7207
URL: https://issues.apache.org/jira/browse/HUDI-7207
Project: Apache Hudi
Issue Type: Bug
Reporter: Ma Jian
Assuming there are 4 instants in a Hudi table that need to be archived, with
timestamps in ascending order (as they have been sorted after obtaining
{{{}instantToArchive{}}}): these are 1.deltacommit, 2.deltacommit,
3.deltacommit, and 4.deltacommit, corresponding to the files a.parquet,
b.parquet, c.parquet, and d.parquet, respectively.
In the archiving code, the deletion of instants is handled by the following
code snippet:
{code:java}
if (!completedInstants.isEmpty()) {
context.foreach(
completedInstants,
instant -> activeTimeline.deleteInstantFileIfExists(instant),
Math.min(completedInstants.size(), config.getArchiveDeleteParallelism())
);
}
{code}
Different instants are distributed across different threads for execution. For
instance, in Spark with a parallelism of 2, they would be distributed as 1 and
2, and 3 and 4. Consequently, there may be scenarios where instant 3 is deleted
before instant 2. If instants 1 and 3 are deleted while 2 and 4 are not yet
deleted, a query request obtaining visibleCommitsAndCompactionTimeline at this
point would find a timeline with instants 2, 4, and so on.
During a query, this would result in the data under c.parquet, corresponding to
instant 3, becoming completely invisible. I believe this is a very problematic
situation, as users could unknowingly retrieve incorrect data.
Here are a few potential solutions I've considered:
1.Prohibit concurrent deletion of completed files. While this would ensure the
order of deletions, it could significantly impact performance, which is not an
optimal solution.
2.Implement a solution similar to a marker file, recording which instants are
in the process of being deleted, and then remove these instants directly from
the timeline during reads.
3.Based on the second solution, incorporate archiving by adding archive
instants to the timeline, allowing for direct retrieval of pending archives
during data reads. Here I have a question: why don't previous archives have
corresponding instant action?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)