[jira] [Created] (HUDI-7207) Concurrent archiving and data reading leads to missing data in query results.

Ma Jian (Jira) Sun, 10 Dec 2023 22:13:04 -0800

Ma Jian created HUDI-7207:
-----------------------------

             Summary: Concurrent archiving and data reading leads to missing 
data in query results.
                 Key: HUDI-7207
                 URL: https://issues.apache.org/jira/browse/HUDI-7207
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Ma Jian



Assuming there are 4 instants in a Hudi table that need to be archived, with 
timestamps in ascending order (as they have been sorted after obtaining 
{{{}instantToArchive{}}}): these are 1.deltacommit, 2.deltacommit, 
3.deltacommit, and 4.deltacommit, corresponding to the files a.parquet, 
b.parquet, c.parquet, and d.parquet, respectively.

In the archiving code, the deletion of instants is handled by the following 
code snippet:
{code:java}
if (!completedInstants.isEmpty()) {
    context.foreach(
        completedInstants,
        instant -> activeTimeline.deleteInstantFileIfExists(instant),
        Math.min(completedInstants.size(), config.getArchiveDeleteParallelism())
    );
}
 {code}
Different instants are distributed across different threads for execution. For 
instance, in Spark with a parallelism of 2, they would be distributed as 1 and 
2, and 3 and 4. Consequently, there may be scenarios where instant 3 is deleted 
before instant 2. If instants 1 and 3 are deleted while 2 and 4 are not yet 
deleted, a query request obtaining visibleCommitsAndCompactionTimeline at this 
point would find a timeline with instants 2, 4, and so on.

During a query, this would result in the data under c.parquet, corresponding to 
instant 3, becoming completely invisible. I believe this is a very problematic 
situation, as users could unknowingly retrieve incorrect data.

Here are a few potential solutions I've considered:

1.Prohibit concurrent deletion of completed files. While this would ensure the 
order of deletions, it could significantly impact performance, which is not an 
optimal solution.

2.Implement a solution similar to a marker file, recording which instants are 
in the process of being deleted, and then remove these instants directly from 
the timeline during reads.

3.Based on the second solution, incorporate archiving by adding archive 
instants to the timeline, allowing for direct retrieval of pending archives 
during data reads. Here I have a question: why don't previous archives have 
corresponding instant action?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7207) Concurrent archiving and data reading leads to missing data in query results.

Reply via email to