[
https://issues.apache.org/jira/browse/HUDI-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ma Jian updated HUDI-7207:
--------------------------
Priority: Blocker (was: Major)
> Concurrent archiving and data reading leads to missing data in query results.
> -----------------------------------------------------------------------------
>
> Key: HUDI-7207
> URL: https://issues.apache.org/jira/browse/HUDI-7207
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Ma Jian
> Priority: Blocker
>
> Assuming there are 4 instants in a Hudi table that need to be archived, with
> timestamps in ascending order (as they have been sorted after obtaining
> {{{}instantToArchive{}}}): these are 1.deltacommit, 2.deltacommit,
> 3.deltacommit, and 4.deltacommit, corresponding to the files a.parquet,
> b.parquet, c.parquet, and d.parquet, respectively.
> In the archiving code, the deletion of instants is handled by the following
> code snippet:
> {code:java}
> if (!completedInstants.isEmpty()) {
> context.foreach(
> completedInstants,
> instant -> activeTimeline.deleteInstantFileIfExists(instant),
> Math.min(completedInstants.size(),
> config.getArchiveDeleteParallelism())
> );
> }
> {code}
> Different instants are distributed across different threads for execution.
> For instance, in Spark with a parallelism of 2, they would be distributed as
> 1 and 2, and 3 and 4. Consequently, there may be scenarios where instant 3 is
> deleted before instant 2. If instants 1 and 3 are deleted while 2 and 4 are
> not yet deleted, a query request obtaining
> visibleCommitsAndCompactionTimeline at this point would find a timeline with
> instants 2, 4, and so on.
> During a query, this would result in the data under c.parquet, corresponding
> to instant 3, becoming completely invisible. I believe this is a very
> problematic situation, as users could unknowingly retrieve incorrect data.
> Here are a few potential solutions I've considered:
> 1.Prohibit concurrent deletion of completed files. While this would ensure
> the order of deletions, it could significantly impact performance, which is
> not an optimal solution.
> 2.Implement a solution similar to a marker file, recording which instants are
> in the process of being deleted, and then remove these instants directly from
> the timeline during reads.
> 3.Based on the second solution, incorporate archiving by adding archive
> instants to the timeline, allowing for direct retrieval of pending archives
> during data reads. Here I have a question: why don't previous archives have
> corresponding instant action?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)