[PR] fix(clustering): retain missing partitions in selected/regex incremental scheduling [hudi]

via GitHub Tue, 09 Jun 2026 01:46:39 -0700


fhan688 opened a new pull request, #18945:
URL: https://github.com/apache/hudi/pull/18945


   ### Describe the issue this Pull Request addresses
   
     Incremental clustering builds its scheduling window from partitions 
changed by new commits plus `missingSchedulePartitions` recorded in the last 
clustering plan.
   
     When clustering only schedules part of that current window, the 
unscheduled partitions need to be written back to `missingSchedulePartitions` 
so that later schedules can pick them up. Today, partitions filtered out by the
     clustering partition regex are dropped from the current window without 
being recorded as missing. Similarly, when users temporarily set 
`hoodie.clustering.plan.strategy.target.partitions`, the strategy reads the 
previous
     plan's missing partitions instead of deriving missing partitions from the 
current scheduling window.
   
     This can cause partitions to be skipped permanently by incremental 
clustering if they do not receive new writes later.
   
     ### Summary and Changelog
   
     This change preserves unscheduled partitions from the current incremental 
clustering window.
   
     Changes:
     - Added common helpers in `ClusteringPlanStrategy` to compute missing 
partitions from the current scheduling window.
     - Updated `PartitionAwareClusteringPlanStrategy` to record partitions 
filtered out by the regex partition filter as `missingSchedulePartitions`.
     - Updated `PartitionAwareClusteringPlanStrategy` to compute missing 
partitions for manually selected partitions as `currentWindow - 
selectedPartitions`.
     - Used insertion-order preserving de-duplication for deterministic missing 
partition output.
     - Added unit coverage for missing partition calculation with incremental 
table service enabled and disabled.
     - Updated incremental clustering regression coverage to verify 
regex-filtered partitions are retained as missing.
   
     No code was copied.
   
     ### Impact
   
     No public API, storage format, or config changes.
   
     This changes incremental clustering scheduling behavior for 
partition-filtered plans: partitions in the current incremental scheduling 
window that are not selected by regex or manual target partition selection are 
retained
     in `missingSchedulePartitions` for future scheduling instead of being 
dropped.
   
     No performance impact is expected beyond small in-memory set operations 
over the current scheduling window.
   
     ### Risk Level
   
     low
   
     The change is limited to clustering plan generation metadata. It does not 
change executor incremental window calculation or clustering execution.
   
     Verification:
     - `git diff --check`
     - `mvn -pl hudi-client/hudi-client-common -am -DskipITs 
-Dcheckstyle.skip=true -DfailIfNoTests=false 
-Dsurefire.failIfNoSpecifiedTests=false 
-Dtest=TestPartitionAwareClusteringPlanStrategy test`
   
     ### Documentation Update
   
     none
   
     No new config, API, or user-facing feature is added.
   
     ### Contributor's checklist
   
     - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
     - [x] Enough context is provided in the sections above
     - [x] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] fix(clustering): retain missing partitions in selected/regex incremental scheduling [hudi]

Reply via email to