zoomake opened a new pull request, #14087: URL: https://github.com/apache/hudi/pull/14087
### What is the purpose of this pull request? This PR fixes a performance issue in Flink clustering where partitions left a single small file (< `small.file.limit`) are repeatedly rewritten in every clustering job execution. ### What problem does this PR solve? In the current `FlinkClusteringPlanStrategy`, the clustering plan selects small files based solely on their size threshold (`small.file.limit`) without considering whether a partition contains only one such small file. As a result, the same small file keeps being included in the clustering plan each time, even though it has already been rewritten, causing redundant clustering operations and unnecessary commits. ### What is the improvement? This change updates `FlinkClusteringPlanStrategy` to skip clustering for partitions that: - contain only **one file**, and - the file is **smaller than `small.file.limit`**. This prevents repeated rewriting of the same small file in clustering jobs. ### Does this change affect other components? No. The logic is applied only in the plan-building phase and does not modify commit or execution flows. ### Example log output JIRA Issue https://issues.apache.org/jira/browse/HUDI-7456 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
