kbuci opened a new pull request, #18174:
URL: https://github.com/apache/hudi/pull/18174

   ### Describe the issue this Pull Request addresses
   
   Adds a config to allow clustering plan strategy to sort file slices by 
commit time (earlier first) before file size when building clustering groups. 
This helps use cases (e.g. stitching) that want to cluster older data first to 
reduce lag. The behavior is opt-in; default remains size-only sorting to 
preserve existing behavior.
   
   ### Summary and Changelog
   
   **Summary:** New config `hoodie.clustering.earlier_instants_first` (default 
`false`). When enabled, `PartitionAwareClusteringPlanStrategy` sorts file 
slices by base file commit time ascending, then by file size descending, so 
older data is clustered first.
   
   **Changelog:**
   - **HoodieClusteringConfig:** Added `EARLIER_INSTANTS_FIRST` config property 
(default `false`) and `Builder.withEarlierInstantsFirst(Boolean)`.
   - **HoodieWriteConfig:** Added `isEarlierInstantsFirst()`.
   - **PartitionAwareClusteringPlanStrategy:** Replaced size-only sort with a 
configurable comparator: when `isEarlierInstantsFirst()` is true, sort by 
commit time then by file size (desc); otherwise keep previous size-descending 
behavior.
   - **TestSparkSizeBasedClusteringPlanStrategy:** Added 
`createFileSliceWithCommitTime(long, String)` and tests: 
`testEarlierInstantsFirstEnabled`, `testEarlierInstantsFirstDisabled`, 
`testCommitTimeOrderingWithSameSizes`, 
`testSortingBehaviorComparisonWithAndWithoutEarlierInstantsFirst`.
   
   No code was copied from other repos; logic was ported from an internal 
commit and adapted to the current APIs (e.g. 
`Pair<Stream<HoodieClusteringGroup>, Boolean>`, 
`shouldClusteringSingleGroup()`).
   
   ### Impact
   
   - **Public API:** New config `hoodie.clustering.earlier_instants_first` and 
builder method 
`HoodieClusteringConfig.Builder.withEarlierInstantsFirst(Boolean)`. No breaking 
changes.
   - **User-facing:** Optional behavior; default `false` keeps current 
clustering order (by size only).
   - **Performance:** Negligible (one extra comparator key when enabled).
   
   ### Risk Level
   
   **Low.** Behavior is off by default. Sorting change is limited to 
`PartitionAwareClusteringPlanStrategy` and covered by new and existing unit 
tests in `TestSparkSizeBasedClusteringPlanStrategy` 
   
   ### Documentation Update
   
   - **Config:** Document `hoodie.clustering.earlier_instants_first` in the 
clustering config section (description and default `false`).
   - **Website:** Optional short note under clustering / tuning that this 
config can be used to prioritize older data when needed (e.g. stitching). No 
website change required for merge if docs are in code/config only.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to