danny0405 opened a new issue #5020:
URL: https://github.com/apache/hudi/issues/5020
Current we have some cleaning strategy such as: `num_commits`, `delta
hours`, `num_versions`.
Let's say user use the `num_commits` strategy.
And it uses the params:
- max 10 commits to archive
- min 4 commits to keep in alive
- 6 commits to clean
c1 ---- c2 ---- c3 ---- c4 ---- c5 ---- c6 ---- c7---- c8 ---- c9 ---- c10
At c10, the reader starts reading the latest fs view with a file slice that
was written in c1,
/+
--- fg1_c1.parquet
And the cleaner also starts working in c10 this time, it finds that the num
commits > 6 (10 > 6) and all the files that committed in c1 ~ c4 was deleted.
And the reader throws `FileNotFoundException`.
This problem is common and occurs frequently especially in streaming read
mode.(also happens if a batch read job is complex and lasts long time).
We need some mechanisms to ensure the semantic integrity of the read view.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]