kbuci opened a new issue, #18014: URL: https://github.com/apache/hudi/issues/18014
### Task Description **What needs to be done:** Currently when incremental clean planner scans all instants since latest earliest commit to retain (ECTR), `org.apache.hudi.table.action.clean.CleanPlanner#getPartitionsForInstants` will add all partitions across all instants' `partitionToWriteStats`. We should optimize this flow such that when processing `commit` instants we only add a partition if the `commit` metadata has any entry in `partitionToWriteStats` where a file group was updated. If any instant updated/replaced a file group, we should still add it to list of partitions to scan (so that we don't "miss" any files to clean). **Why this task is needed:* For insert-only workloads (where small file handling is disabled), if there are thousands of partitions touched by `commit`s since the latest ECTR, then clean planner will have to unnecessarily scan all of these partitions (even though there is nothing to clean in these partitions). ### Task Type Performance optimization ### Related Issues **Parent feature issue:** (if applicable ) **Related issues:** NOTE: Use `Relationships` button to add parent/blocking issues after issue is created. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
