prashantwason opened a new pull request, #17550: URL: https://github.com/apache/hudi/pull/17550
### Describe the issue this Pull Request addresses This PR adds a new configuration option to help manage memory usage during clean operations on very large tables with many partitions. When incremental cleaning is disabled, users can now use a regex pattern to selectively clean specific partitions rather than processing all partitions at once, which can lead to OOM (Out of Memory) failures. ### Summary and Changelog **Summary:** Users can now use `hoodie.cleaner.partition.filter.regex` config to restrict which partitions are cleaned during full clean operations, enabling better control over memory usage for large tables. **Changelog:** - Added new config `hoodie.cleaner.partition.filter.regex` in `HoodieCleanConfig` to specify regex pattern for filtering partitions during clean - Exposed the config through `HoodieWriteConfig.getCleanerPartitionRegex()` method - Modified `CleanPlanner.getPartitionPathsForFullCleaning()` to apply regex filtering on partition paths when the config is set - Added validation to prevent using this config when incremental cleaning mode is enabled (as they are mutually exclusive) ### Impact **Public API Changes:** - New config property: `hoodie.cleaner.partition.filter.regex` (default: empty string) - New public method: `HoodieWriteConfig.getCleanerPartitionRegex()` **User-Facing Changes:** Users can now specify a regex pattern to filter partitions during full clean operations. For example: - To clean only 2024 partitions: `hoodie.cleaner.partition.filter.regex=2024.*` - To clean specific partition patterns: `hoodie.cleaner.partition.filter.regex=partition_(a|b|c).*` **Performance Impact:** Positive - Reduces memory footprint during clean operations by processing fewer partitions at a time, helping avoid OOM failures on large tables. ### Risk Level **Low** The change is backward compatible with a safe default (empty string means no filtering, preserving existing behavior). The feature only activates when explicitly configured by users. Additionally: - Validation prevents conflicting configuration (incremental mode + regex filtering) - The regex filtering is applied after fetching partition paths, so it doesn't affect partition discovery logic - Only affects the clean operation, no impact on read/write paths ### Documentation Update **Config Documentation:** - Updated `HoodieCleanConfig` with documentation for the new `hoodie.cleaner.partition.filter.regex` config property - Documentation explains the use case (avoiding OOM on large tables) and constraint (cannot be used with incremental cleaning mode) **Website Update:** The configuration documentation on the Hudi website should be updated to include this new config parameter in the table services / cleaning section. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
