kbuci opened a new issue, #17956: URL: https://github.com/apache/hudi/issues/17956
### Feature Description **What the feature achieves:** Add a new clustering config where, if enabled, `org.apache.hudi.table.action.cluster.strategy.PartitionAwareClusteringPlanStrategy#buildClusteringGroupsForPartition` will target earlier instant times in a partition first (if not all files in a partition can be clustered). For example, we can change from sorting input data files by `(-1 * file size)` to instead `(instant time, -1 * file size)` when building clustering groups. **Why this feature is needed:** We have use cases for bulk-insert datasets we run clustering on the latest few partitions to "stitch" together small files. But our jobs do not have sufficient resources to target all files in the partition in a short enough time. In this scenario, we want to prioritize earlier files in a partition to be stitched together first, since we expect queries to target data in those files first. We have implemented the suggested implementaiton internally the above to achieve this guarantee. Although this is less optimal in terms of "packing files", it is fine for our use case as we typically have hundreds or thousands of files for each instant time. ### User Experience **How users will use this feature:** - Configuration changes needed - API changes - Usage examples ### Hudi RFC Requirements **RFC PR link:** (if applicable) **Why RFC is/isn't needed:** - Does this change public interfaces/APIs? (Yes/No) - Does this change storage format? (Yes/No) - Justification: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
