kbuci opened a new issue, #18014:
URL: https://github.com/apache/hudi/issues/18014

   ### Task Description
   
   **What needs to be done:**
   Currently when incremental clean planner scans all instants since latest 
earliest commit to retain (ECTR), 
`org.apache.hudi.table.action.clean.CleanPlanner#getPartitionsForInstants` will 
add all partitions across all instants' `partitionToWriteStats`. We should 
optimize this flow such that when processing `commit` instants we only add a 
partition if the `commit` metadata has any entry in `partitionToWriteStats` 
where a file group was updated. If any instant updated/replaced a file group, 
we should still add it to list of partitions to scan (so that we don't "miss" 
any files to clean).
   
   **Why this task is needed:*
   For insert-only workloads (where small file handling is disabled), if there 
are thousands of partitions touched by `commit`s since the latest ECTR, then 
clean planner will have to unnecessarily scan all of these partitions (even 
though there is nothing to clean in these partitions).
   
   ### Task Type
   
   Performance optimization
   
   ### Related Issues
   
   **Parent feature issue:** (if applicable )
   **Related issues:**
   NOTE: Use `Relationships` button to add parent/blocking issues after issue 
is created.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to