prashantwason opened a new pull request, #17550:
URL: https://github.com/apache/hudi/pull/17550

   ### Describe the issue this Pull Request addresses
   
   This PR adds a new configuration option to help manage memory usage during 
clean operations on very large tables with many partitions. When incremental 
cleaning is disabled, users can now use a regex pattern to selectively clean 
specific partitions rather than processing all partitions at once, which can 
lead to OOM (Out of Memory) failures.
   
   ### Summary and Changelog
   
   **Summary:**
   Users can now use `hoodie.cleaner.partition.filter.regex` config to restrict 
which partitions are cleaned during full clean operations, enabling better 
control over memory usage for large tables.
   
   **Changelog:**
   - Added new config `hoodie.cleaner.partition.filter.regex` in 
`HoodieCleanConfig` to specify regex pattern for filtering partitions during 
clean
   - Exposed the config through `HoodieWriteConfig.getCleanerPartitionRegex()` 
method
   - Modified `CleanPlanner.getPartitionPathsForFullCleaning()` to apply regex 
filtering on partition paths when the config is set
   - Added validation to prevent using this config when incremental cleaning 
mode is enabled (as they are mutually exclusive)
   
   ### Impact
   
   **Public API Changes:**
   - New config property: `hoodie.cleaner.partition.filter.regex` (default: 
empty string)
   - New public method: `HoodieWriteConfig.getCleanerPartitionRegex()`
   
   **User-Facing Changes:**
   Users can now specify a regex pattern to filter partitions during full clean 
operations. For example:
   - To clean only 2024 partitions: 
`hoodie.cleaner.partition.filter.regex=2024.*`
   - To clean specific partition patterns: 
`hoodie.cleaner.partition.filter.regex=partition_(a|b|c).*`
   
   **Performance Impact:**
   Positive - Reduces memory footprint during clean operations by processing 
fewer partitions at a time, helping avoid OOM failures on large tables.
   
   ### Risk Level
   
   **Low**
   
   The change is backward compatible with a safe default (empty string means no 
filtering, preserving existing behavior). The feature only activates when 
explicitly configured by users. Additionally:
   - Validation prevents conflicting configuration (incremental mode + regex 
filtering)
   - The regex filtering is applied after fetching partition paths, so it 
doesn't affect partition discovery logic
   - Only affects the clean operation, no impact on read/write paths
   
   ### Documentation Update
   
   **Config Documentation:**
   - Updated `HoodieCleanConfig` with documentation for the new 
`hoodie.cleaner.partition.filter.regex` config property
   - Documentation explains the use case (avoiding OOM on large tables) and 
constraint (cannot be used with incremental cleaning mode)
   
   **Website Update:**
   The configuration documentation on the Hudi website should be updated to 
include this new config parameter in the table services / cleaning section.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to