kbuci opened a new pull request, #18016:
URL: https://github.com/apache/hudi/pull/18016

   ### Describe the issue this Pull Request addresses
   
   https://github.com/apache/hudi/issues/18014 
   
   ### Summary and Changelog
   
   Optimizes the incremental clean planner to only target partitions in commit 
instants that have modified existing file groups.
   Changes:
   Updated CleanPlanner#getPartitionsForInstants() to use 
getWritePartitionPathsWithExistingFileGroupsModified() instead of returning all 
partitions from getPartitionToWriteStats().keySet()
   Added getWritePartitionPathsWithExistingFileGroupsModified() override in 
HoodieReplaceCommitMetadata to include partitions with replaced file IDs
   Added unit tests for getWritePartitionPathsWithExistingFileGroupsModified() 
covering insert-only, update-only, and mixed scenarios
   Behavior change: When clean planner incrementally processes instants since 
the last earliest-commit-to-retain (ECTR), it now only selects partitions where 
file groups were actually updated or replaced. Insert-only operations that 
create new file groups in a partition no longer trigger unnecessary partition 
scans during cleaning.
   
   ### Impact
   
   No public API changes. Internal performance optimization that reduces the 
number of partitions scanned during incremental cleaning. For workloads with 
many insert-only commits touching thousands of partitions, this significantly 
reduces clean planning overhead.
   
   ### Risk Level
   
   Medium - This optimization only skips partitions that contain no files 
eligible for cleaning. The 
getWritePartitionPathsWithExistingFileGroupsModified() method filters out stats 
where prevCommit is null or "null" (indicating a new file group insert), 
ensuring only partitions with actual file modifications are processed.
   
   ### Documentation Update
   
   None - This is an internal optimization with no new configs or user-facing 
changes.
   Contributor's checklist
   [x] Read through contributor's guide
   [x] Enough context is provided in the sections above
   [] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to