kbuci opened a new issue, #17956:
URL: https://github.com/apache/hudi/issues/17956

   ### Feature Description
   
   **What the feature achieves:**
   Add a new clustering config where, if enabled, 
`org.apache.hudi.table.action.cluster.strategy.PartitionAwareClusteringPlanStrategy#buildClusteringGroupsForPartition`
 will target earlier instant times in a partition first (if not all files in a 
partition can be clustered). For example, we can change from sorting input data 
files by `(-1 * file size)` to instead `(instant time, -1 * file size)` when 
building clustering groups.
   
   **Why this feature is needed:**
   We have use cases for bulk-insert datasets we run clustering on the latest 
few partitions to "stitch" together small files. But our jobs do not have 
sufficient resources to target all files in the partition in a short enough 
time. In this scenario, we want to prioritize earlier files in a partition to 
be stitched together first, since we expect queries to target data in those 
files first. We have implemented the suggested implementaiton internally the 
above to achieve this guarantee. Although this is less optimal in terms of 
"packing files", it is fine for our use case as we typically have hundreds or 
thousands of files for each instant time.
   
   ### User Experience
   
   **How users will use this feature:**
   - Configuration changes needed
   - API changes
   - Usage examples
   
   
   ### Hudi RFC Requirements
   
   **RFC PR link:** (if applicable)
   
   **Why RFC is/isn't needed:**
   - Does this change public interfaces/APIs? (Yes/No)
   - Does this change storage format? (Yes/No)
   - Justification:
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to