[PR] [HUDI-7207] During archiving, complete instants are deleted serially to prevent data errors during data reads. [hudi]

via GitHub Wed, 13 Dec 2023 18:54:18 -0800


majian1998 opened a new pull request, #10325:
URL: https://github.com/apache/hudi/pull/10325


   This PR is more about discussing with everyone how to fix the existing 
issues.
   
   Assuming there are 4 instants in a Hudi table that need to be archived, with 
timestamps in ascending order (as they have been sorted after obtaining 
instantToArchive): these are 1.deltacommit, 2.deltacommit, 3.deltacommit, and 
4.deltacommit, corresponding to the files a.parquet, b.parquet, c.parquet, and 
d.parquet, respectively.
   
   In the archiving code, the deletion of instants is handled by the following 
code snippet:
   
   ```
   if (!completedInstants.isEmpty()) {
       context.foreach(
           completedInstants,
           instant -> activeTimeline.deleteInstantFileIfExists(instant),
           Math.min(completedInstants.size(), 
config.getArchiveDeleteParallelism())
       );
   }
   ```
    
   
   Different instants are distributed across different threads for execution. 
For instance, in Spark with a parallelism of 2, they would be distributed as 1 
and 2, and 3 and 4. Consequently, there may be scenarios where instant 3 is 
deleted before instant 2. If instants 1 and 3 are deleted while 2 and 4 are not 
yet deleted, a query request obtaining visibleCommitsAndCompactionTimeline at 
this point would find a timeline with instants 2, 4, and so on.
   
   During a query, this would result in the data under c.parquet, corresponding 
to instant 3, becoming completely invisible. I believe this is a very 
problematic situation, as users could unknowingly retrieve incorrect data.
   
   Here are a few potential solutions I've considered:
   
   1.Prohibit concurrent deletion of completed files. While this would ensure 
the order of deletions, it could significantly impact performance, which is not 
an optimal solution. Serially deleting instants may be slower, but as there are 
usually few to remove, it is an acceptable stopgap. The change can be undone 
when a superior solution is found.
   
   2.Implement a solution similar to a marker file, recording which instants 
are in the process of being deleted, and then remove these instants directly 
from the timeline during reads.
   
   3.Based on the second solution, incorporate archiving by adding archive 
instants to the timeline, allowing for direct retrieval of pending archives 
during data reads. Here I have a question: why don't previous archives have 
corresponding instant action?
   
   ### Change Logs
   
   delete complete instants serially.
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   Low
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [HUDI-7207] During archiving, complete instants are deleted serially to prevent data errors during data reads. [hudi]

Reply via email to