nsivabalan opened a new pull request, #18322:
URL: https://github.com/apache/hudi/pull/18322

   ### Describe the issue this Pull Request addresses
   
   When the cleaner is disabled for an extended period and the timeline grows 
large (e.g., 1000+ commits), resuming the cleaner can attempt
      to clean hundreds or thousands of commits worth of file slices in a 
single operation. This can cause memory pressure during clean
     planning, long-running operations that may timeout, OOM errors, and 
operational instability.
   
     This PR introduces a new configuration hoodie.clean.max.commits.to.clean 
to cap the maximum number of commits that can be cleaned in a
     single clean operation, allowing for gradual, incremental cleanup over 
multiple clean runs.
   
   ### Summary and Changelog
   
   Summary:
     Users can now configure hoodie.clean.max.commits.to.clean to limit the 
number of commits cleaned per operation. This prevents resource
     exhaustion when resuming cleaning after a long period of inactivity. The 
cleaner will incrementally catch up over multiple runs,
     cleaning up to the configured limit each time.
   
     Changelog:
   
     - feat(clean): Add hoodie.clean.max.commits.to.clean configuration to cap 
commits cleaned per operation
       - Added MAX_COMMITS_TO_CLEAN config in HoodieCleanConfig with default 
value Long.MAX_VALUE
       - Added getMaxCommitsToClean() accessor in HoodieWriteConfig
       - Added withMaxCommitsToClean() builder method in 
HoodieCleanConfig.Builder
     - core: Updated CleanerUtils.getEarliestCommitToRetain() to support capping
       - Extended method signature to accept previousEarliestCommitToRetain and 
maxCommitsToClean parameters
       - Added capCommitsToClean() helper method to adjust earliest commit when 
cap is exceeded
       - Logs when capping is applied with before/after commit counts
     - core: Updated CleanPlanner.getEarliestCommitToRetain() to retrieve 
previous clean metadata
       - Reads earliestCommitToRetain from last completed clean's metadata
       - Passes previous clean info and config to 
CleanerUtils.getEarliestCommitToRetain()
       - Gracefully handles missing previous clean metadata (no capping applied)
     - core: Updated ArchivalUtils.getEarliestCommitToRetain() to pass empty 
values for new parameters
       - Archival continues to work without capping (uses Option.empty() and 
Long.MAX_VALUE)
     - test: Added comprehensive unit tests in TestCleanerUtils
       - Tests for KEEP_LATEST_COMMITS policy with/without capping
       - Tests for KEEP_LATEST_BY_HOURS policy with/without capping
       - Tests for boundary conditions, missing previous clean, and default 
values
       - Added helper methods to create mock timelines with realistic timestamps
   
   ### Impact
   
    Configuration:
     - New advanced configuration: hoodie.clean.max.commits.to.clean (default: 
Long.MAX_VALUE)
     - Applicable to KEEP_LATEST_COMMITS and KEEP_LATEST_BY_HOURS cleaning 
policies
     - Fully backward compatible - existing behavior unchanged unless 
explicitly configured
   
     Behavior Change:
     - When hoodie.clean.max.commits.to.clean is set to a value < 
Long.MAX_VALUE:
       - The cleaner will limit the number of commits cleaned in each operation
       - Multiple clean runs may be needed to fully catch up when timeline has 
many commits to clean
       - Each clean operation will adjust earliestCommitToRetain based on the 
cap
   
     Performance Impact:
     - Positive: Prevents memory exhaustion and timeouts during large clean 
operations
     - Positive: Allows operators to control resource usage during cleanup
     - Neutral: Default behavior unchanged (no capping)
     - Trade-off: May require multiple clean runs to fully catch up 
(intentional for safety)
   
     Example Usage:
     # Cap cleaning to 50 commits per operation
     hoodie.clean.max.commits.to.clean=50
   
     Monitoring:
     - Log messages indicate when capping is applied:
     INFO CleanerUtils - Capping commits to clean from 988 to 50. Adjusted 
earliest commit to retain from 20000000000988 to 20000000000050
   
   ### Risk Level
   
   low
   
   ### Documentation Update
   
   <!-- Describe any necessary documentation update if there is any new 
feature, config, or user-facing change. If not, put "none".
   
   - The config description must be updated if new configs are added or the 
default value of the configs are changed.
   - Any new feature or user-facing change requires updating the Hudi website. 
Please follow the 
     [instruction](https://hudi.apache.org/contribute/developer-setup#website) 
to make changes to the website. -->
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to