danny0405 opened a new issue #5020:
URL: https://github.com/apache/hudi/issues/5020


   Current we have some cleaning strategy such as: `num_commits`, `delta 
hours`, `num_versions`.
   Let's say user use the `num_commits` strategy.
   
   And it uses the params:
   
   - max 10 commits to archive
   - min 4 commits to keep in alive
   - 6 commits to clean
   
   c1 ---- c2 ---- c3 ---- c4 ---- c5 ---- c6 ---- c7---- c8 ---- c9 ---- c10
   
   At c10, the reader starts reading the latest fs view with a file slice that 
was written in c1,
   
   /+
     --- fg1_c1.parquet
   
   And the cleaner also starts working in c10 this time, it finds that the num 
commits > 6 (10 > 6) and all the files that committed in c1 ~ c4 was deleted. 
And the reader throws `FileNotFoundException`.
   
   This problem is common and occurs frequently especially in streaming read 
mode.(also happens if a batch read job is complex and lasts long time).
   
   We need some mechanisms to ensure the semantic integrity of the read view.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to