[I] [to be discussed] Spark clustering planner should support skipping clustering groups where number of input and output file slices are same [hudi]

via GitHub Fri, 16 Jan 2026 10:39:50 -0800


kbuci opened a new issue, #17918:
URL: https://github.com/apache/hudi/issues/17918


   ### Feature Description
   
   **What the feature achieves:**
   Add a new clustering config, where if enabled, 
`org.apache.hudi.table.action.cluster.strategy.PartitionAwareClusteringPlanStrategy#buildClusteringGroupsForPartition`
 will not emit a clustering group if the number of input and output file slices 
are the same.
   
   **Why this feature is needed:**
   For our use case we use an execution class similar to 
`org.apache.hudi.client.clustering.run.strategy.SparkBinaryCopyClusteringExecutionStrategy`
 as a clustering execution class to merge smaller files together into one large 
file. We set `hoodie.clustering.plan.strategy.target.file.max.bytes` to the 
target output file size value. But there are scenarios where clustering groups 
with just one input & output file slice are created. For example, if a 
partition has these files
   ```
   file1: 800 MB
   file2: 800 MB
   file3: 200 MB
   ```
   and we cluster with target file size/clustering group size of 1 GB, then we 
will emit clustering groups
   `([file1], 1), ([file2, file3], 2)` and we want to not execute the first 
group.
   
   Adding the above config would achieve this. 
   We can upstream our internal implementation once we reach consensus 
   
   ### User Experience
   
   **How users will use this feature:**
   - Configuration changes needed
   - API changes
   - Usage examples
   
   
   ### Hudi RFC Requirements
   
   **RFC PR link:** (if applicable)
   
   **Why RFC is/isn't needed:**
   - Does this change public interfaces/APIs? (Yes/No)
   - Does this change storage format? (Yes/No)
   - Justification:
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [to be discussed] Spark clustering planner should support skipping clustering groups where number of input and output file slices are same [hudi]

Reply via email to