nsivabalan opened a new pull request, #18322:
URL: https://github.com/apache/hudi/pull/18322
### Describe the issue this Pull Request addresses
When the cleaner is disabled for an extended period and the timeline grows
large (e.g., 1000+ commits), resuming the cleaner can attempt
to clean hundreds or thousands of commits worth of file slices in a
single operation. This can cause memory pressure during clean
planning, long-running operations that may timeout, OOM errors, and
operational instability.
This PR introduces a new configuration hoodie.clean.max.commits.to.clean
to cap the maximum number of commits that can be cleaned in a
single clean operation, allowing for gradual, incremental cleanup over
multiple clean runs.
### Summary and Changelog
Summary:
Users can now configure hoodie.clean.max.commits.to.clean to limit the
number of commits cleaned per operation. This prevents resource
exhaustion when resuming cleaning after a long period of inactivity. The
cleaner will incrementally catch up over multiple runs,
cleaning up to the configured limit each time.
Changelog:
- feat(clean): Add hoodie.clean.max.commits.to.clean configuration to cap
commits cleaned per operation
- Added MAX_COMMITS_TO_CLEAN config in HoodieCleanConfig with default
value Long.MAX_VALUE
- Added getMaxCommitsToClean() accessor in HoodieWriteConfig
- Added withMaxCommitsToClean() builder method in
HoodieCleanConfig.Builder
- core: Updated CleanerUtils.getEarliestCommitToRetain() to support capping
- Extended method signature to accept previousEarliestCommitToRetain and
maxCommitsToClean parameters
- Added capCommitsToClean() helper method to adjust earliest commit when
cap is exceeded
- Logs when capping is applied with before/after commit counts
- core: Updated CleanPlanner.getEarliestCommitToRetain() to retrieve
previous clean metadata
- Reads earliestCommitToRetain from last completed clean's metadata
- Passes previous clean info and config to
CleanerUtils.getEarliestCommitToRetain()
- Gracefully handles missing previous clean metadata (no capping applied)
- core: Updated ArchivalUtils.getEarliestCommitToRetain() to pass empty
values for new parameters
- Archival continues to work without capping (uses Option.empty() and
Long.MAX_VALUE)
- test: Added comprehensive unit tests in TestCleanerUtils
- Tests for KEEP_LATEST_COMMITS policy with/without capping
- Tests for KEEP_LATEST_BY_HOURS policy with/without capping
- Tests for boundary conditions, missing previous clean, and default
values
- Added helper methods to create mock timelines with realistic timestamps
### Impact
Configuration:
- New advanced configuration: hoodie.clean.max.commits.to.clean (default:
Long.MAX_VALUE)
- Applicable to KEEP_LATEST_COMMITS and KEEP_LATEST_BY_HOURS cleaning
policies
- Fully backward compatible - existing behavior unchanged unless
explicitly configured
Behavior Change:
- When hoodie.clean.max.commits.to.clean is set to a value <
Long.MAX_VALUE:
- The cleaner will limit the number of commits cleaned in each operation
- Multiple clean runs may be needed to fully catch up when timeline has
many commits to clean
- Each clean operation will adjust earliestCommitToRetain based on the
cap
Performance Impact:
- Positive: Prevents memory exhaustion and timeouts during large clean
operations
- Positive: Allows operators to control resource usage during cleanup
- Neutral: Default behavior unchanged (no capping)
- Trade-off: May require multiple clean runs to fully catch up
(intentional for safety)
Example Usage:
# Cap cleaning to 50 commits per operation
hoodie.clean.max.commits.to.clean=50
Monitoring:
- Log messages indicate when capping is applied:
INFO CleanerUtils - Capping commits to clean from 988 to 50. Adjusted
earliest commit to retain from 20000000000988 to 20000000000050
### Risk Level
low
### Documentation Update
<!-- Describe any necessary documentation update if there is any new
feature, config, or user-facing change. If not, put "none".
- The config description must be updated if new configs are added or the
default value of the configs are changed.
- Any new feature or user-facing change requires updating the Hudi website.
Please follow the
[instruction](https://hudi.apache.org/contribute/developer-setup#website)
to make changes to the website. -->
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Enough context is provided in the sections above
- [ ] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]