kbuci opened a new pull request, #18172:
URL: https://github.com/apache/hudi/pull/18172

   ### Describe the issue this Pull Request addresses
   
   When the single clustering group config is **disabled** 
(`hoodie.clustering.plan.strategy.single.group.clustering.enabled=false`), the 
clustering plan strategy could still create clustering groups where both the 
number of input files and output files was 1. Clustering one file into one file 
has no benefit and wastes resources. This fix ensures that when single-group 
clustering is disabled, such no-op groups are not created.
   
   ### Summary and Changelog
   
   **Summary:** When single-group clustering is disabled, clustering no longer 
schedules groups that would cluster one file into one file. All other 
clustering behavior is unchanged.
   
   **Changelog:**
   - **PartitionAwareClusteringPlanStrategy**:  If 
`isSingleGroupClusteringEnabled` is enabled, then clustering groups should be 
skipped if # of input/output file slices are the same
   
   ### Impact
   
   - **Public API:** None.
   - **Performance:** Reduces unnecessary clustering work and scheduling for 
single-file partitions when the config is disabled.
   
   ### Risk Level
   
   **Low.** The change only affects the case where single-group clustering is 
disabled and a group would have 1 input and 1 output; all other behavior is 
unchanged. Logic is covered by the new unit test 
`testRemaningFileInPartitionNotClustered()`.
   
   ### Documentation Update
   
   None. This is a behavioral fix for an existing config; no new config or 
default change.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to