zoomake opened a new pull request, #14087:
URL: https://github.com/apache/hudi/pull/14087

   ### What is the purpose of this pull request?
   This PR fixes a performance issue in Flink clustering where partitions left 
a single small file 
   (< `small.file.limit`) are repeatedly rewritten in every clustering job 
execution.
   
   ### What problem does this PR solve?
   In the current `FlinkClusteringPlanStrategy`, the clustering plan selects 
small files based solely on their size 
   threshold (`small.file.limit`) without considering whether a partition 
contains only one such small file. 
   
   As a result, the same small file keeps being included in the clustering plan 
each time, 
   even though it has already been rewritten, causing redundant clustering 
operations 
   and unnecessary commits.
   
   ### What is the improvement?
   This change updates `FlinkClusteringPlanStrategy` to skip clustering for 
partitions that:
   - contain only **one file**, and
   - the file is **smaller than `small.file.limit`**.
   
   This prevents repeated rewriting of the same small file in clustering jobs.
   
   ### Does this change affect other components?
   No. The logic is applied only in the plan-building phase and does not modify 
commit or execution flows.
   
   ### Example log output
   
   
   JIRA Issue
   
   https://issues.apache.org/jira/browse/HUDI-7456


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to