[ 
https://issues.apache.org/jira/browse/HUDI-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ma Jian updated HUDI-7207:
--------------------------
    Priority: Blocker  (was: Major)

> Concurrent archiving and data reading leads to missing data in query results.
> -----------------------------------------------------------------------------
>
>                 Key: HUDI-7207
>                 URL: https://issues.apache.org/jira/browse/HUDI-7207
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Ma Jian
>            Priority: Blocker
>
> Assuming there are 4 instants in a Hudi table that need to be archived, with 
> timestamps in ascending order (as they have been sorted after obtaining 
> {{{}instantToArchive{}}}): these are 1.deltacommit, 2.deltacommit, 
> 3.deltacommit, and 4.deltacommit, corresponding to the files a.parquet, 
> b.parquet, c.parquet, and d.parquet, respectively.
> In the archiving code, the deletion of instants is handled by the following 
> code snippet:
> {code:java}
> if (!completedInstants.isEmpty()) {
>     context.foreach(
>         completedInstants,
>         instant -> activeTimeline.deleteInstantFileIfExists(instant),
>         Math.min(completedInstants.size(), 
> config.getArchiveDeleteParallelism())
>     );
> }
>  {code}
> Different instants are distributed across different threads for execution. 
> For instance, in Spark with a parallelism of 2, they would be distributed as 
> 1 and 2, and 3 and 4. Consequently, there may be scenarios where instant 3 is 
> deleted before instant 2. If instants 1 and 3 are deleted while 2 and 4 are 
> not yet deleted, a query request obtaining 
> visibleCommitsAndCompactionTimeline at this point would find a timeline with 
> instants 2, 4, and so on.
> During a query, this would result in the data under c.parquet, corresponding 
> to instant 3, becoming completely invisible. I believe this is a very 
> problematic situation, as users could unknowingly retrieve incorrect data.
> Here are a few potential solutions I've considered:
> 1.Prohibit concurrent deletion of completed files. While this would ensure 
> the order of deletions, it could significantly impact performance, which is 
> not an optimal solution.
> 2.Implement a solution similar to a marker file, recording which instants are 
> in the process of being deleted, and then remove these instants directly from 
> the timeline during reads.
> 3.Based on the second solution, incorporate archiving by adding archive 
> instants to the timeline, allowing for direct retrieval of pending archives 
> during data reads. Here I have a question: why don't previous archives have 
> corresponding instant action?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to