kbuci opened a new pull request, #18016: URL: https://github.com/apache/hudi/pull/18016
### Describe the issue this Pull Request addresses https://github.com/apache/hudi/issues/18014 ### Summary and Changelog Optimizes the incremental clean planner to only target partitions in commit instants that have modified existing file groups. Changes: Updated CleanPlanner#getPartitionsForInstants() to use getWritePartitionPathsWithExistingFileGroupsModified() instead of returning all partitions from getPartitionToWriteStats().keySet() Added getWritePartitionPathsWithExistingFileGroupsModified() override in HoodieReplaceCommitMetadata to include partitions with replaced file IDs Added unit tests for getWritePartitionPathsWithExistingFileGroupsModified() covering insert-only, update-only, and mixed scenarios Behavior change: When clean planner incrementally processes instants since the last earliest-commit-to-retain (ECTR), it now only selects partitions where file groups were actually updated or replaced. Insert-only operations that create new file groups in a partition no longer trigger unnecessary partition scans during cleaning. ### Impact No public API changes. Internal performance optimization that reduces the number of partitions scanned during incremental cleaning. For workloads with many insert-only commits touching thousands of partitions, this significantly reduces clean planning overhead. ### Risk Level Medium - This optimization only skips partitions that contain no files eligible for cleaning. The getWritePartitionPathsWithExistingFileGroupsModified() method filters out stats where prevCommit is null or "null" (indicating a new file group insert), ensuring only partitions with actual file modifications are processed. ### Documentation Update None - This is an internal optimization with no new configs or user-facing changes. Contributor's checklist [x] Read through contributor's guide [x] Enough context is provided in the sections above [] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
