majian1998 opened a new pull request, #9243: URL: https://github.com/apache/hudi/pull/9243
… when the earliest ActiveTimeline is a pending commit. ### Change Logs Modify the method of getEarliestCommitToRetain in CleanPlanner when the first commit is a pending commit during a clean operation, to ensure that incremental clean can be executed normally. ### Impact None ### Risk level (write none, low medium or high below) None ### Documentation Update When performing a clean, the earliest commit to be retained obtained by the getEarliestCommitToRetain method in CleanPlanner is used as the endpoint of the clean. However, when a pending commit takes a long time and all the commits earlier than the pending commit have been achieved, the pending commit becomes the earliest active timeline. In this situation, if getEarliestCommitToRetain is called, it will return empty because there is no earlier commit than the pending commit. During an incremental clean, the previous endpoint, which is the last commit retained in the previous clean, is used as the starting point. However, if this starting point is empty, a full clean will be triggered, which is very resource-intensive. To solve this problem without affecting normal clean, I set the EarliestCommitToRetain obtained in this case to the earliest pending commit. Since the endpoint will not be cleaned in the current clean, this approach can solve the aforementioned problem without affecting normal clean. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
