kbuci opened a new pull request, #18174: URL: https://github.com/apache/hudi/pull/18174
### Describe the issue this Pull Request addresses Adds a config to allow clustering plan strategy to sort file slices by commit time (earlier first) before file size when building clustering groups. This helps use cases (e.g. stitching) that want to cluster older data first to reduce lag. The behavior is opt-in; default remains size-only sorting to preserve existing behavior. ### Summary and Changelog **Summary:** New config `hoodie.clustering.earlier_instants_first` (default `false`). When enabled, `PartitionAwareClusteringPlanStrategy` sorts file slices by base file commit time ascending, then by file size descending, so older data is clustered first. **Changelog:** - **HoodieClusteringConfig:** Added `EARLIER_INSTANTS_FIRST` config property (default `false`) and `Builder.withEarlierInstantsFirst(Boolean)`. - **HoodieWriteConfig:** Added `isEarlierInstantsFirst()`. - **PartitionAwareClusteringPlanStrategy:** Replaced size-only sort with a configurable comparator: when `isEarlierInstantsFirst()` is true, sort by commit time then by file size (desc); otherwise keep previous size-descending behavior. - **TestSparkSizeBasedClusteringPlanStrategy:** Added `createFileSliceWithCommitTime(long, String)` and tests: `testEarlierInstantsFirstEnabled`, `testEarlierInstantsFirstDisabled`, `testCommitTimeOrderingWithSameSizes`, `testSortingBehaviorComparisonWithAndWithoutEarlierInstantsFirst`. No code was copied from other repos; logic was ported from an internal commit and adapted to the current APIs (e.g. `Pair<Stream<HoodieClusteringGroup>, Boolean>`, `shouldClusteringSingleGroup()`). ### Impact - **Public API:** New config `hoodie.clustering.earlier_instants_first` and builder method `HoodieClusteringConfig.Builder.withEarlierInstantsFirst(Boolean)`. No breaking changes. - **User-facing:** Optional behavior; default `false` keeps current clustering order (by size only). - **Performance:** Negligible (one extra comparator key when enabled). ### Risk Level **Low.** Behavior is off by default. Sorting change is limited to `PartitionAwareClusteringPlanStrategy` and covered by new and existing unit tests in `TestSparkSizeBasedClusteringPlanStrategy` ### Documentation Update - **Config:** Document `hoodie.clustering.earlier_instants_first` in the clustering config section (description and default `false`). - **Website:** Optional short note under clustering / tuning that this config can be used to prioritize older data when needed (e.g. stitching). No website change required for merge if docs are in code/config only. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Enough context is provided in the sections above - [ ] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
