majian1998 opened a new pull request, #9243:
URL: https://github.com/apache/hudi/pull/9243

   … when the earliest ActiveTimeline is a pending commit.
   
   ### Change Logs
   
   Modify the method of getEarliestCommitToRetain in CleanPlanner when the 
first commit is a pending commit during a clean operation, to ensure that 
incremental clean can be executed normally.
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   When performing a clean, the earliest commit to be retained obtained by the 
getEarliestCommitToRetain method in CleanPlanner is used as the endpoint of the 
clean. However, when a pending commit takes a long time and all the commits 
earlier than the pending commit have been achieved, the pending commit becomes 
the earliest active timeline. In this situation, if getEarliestCommitToRetain 
is called, it will return empty because there is no earlier commit than the 
pending commit. During an incremental clean, the previous endpoint, which is 
the last commit retained in the previous clean, is used as the starting point. 
However, if this starting point is empty, a full clean will be triggered, which 
is very resource-intensive.
   To solve this problem without affecting normal clean, I set the 
EarliestCommitToRetain obtained in this case to the earliest pending commit. 
Since the endpoint will not be cleaned in the current clean, this approach can 
solve the aforementioned problem without affecting normal clean.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to