fhan688 opened a new pull request, #18945:
URL: https://github.com/apache/hudi/pull/18945
### Describe the issue this Pull Request addresses
Incremental clustering builds its scheduling window from partitions
changed by new commits plus `missingSchedulePartitions` recorded in the last
clustering plan.
When clustering only schedules part of that current window, the
unscheduled partitions need to be written back to `missingSchedulePartitions`
so that later schedules can pick them up. Today, partitions filtered out by the
clustering partition regex are dropped from the current window without
being recorded as missing. Similarly, when users temporarily set
`hoodie.clustering.plan.strategy.target.partitions`, the strategy reads the
previous
plan's missing partitions instead of deriving missing partitions from the
current scheduling window.
This can cause partitions to be skipped permanently by incremental
clustering if they do not receive new writes later.
### Summary and Changelog
This change preserves unscheduled partitions from the current incremental
clustering window.
Changes:
- Added common helpers in `ClusteringPlanStrategy` to compute missing
partitions from the current scheduling window.
- Updated `PartitionAwareClusteringPlanStrategy` to record partitions
filtered out by the regex partition filter as `missingSchedulePartitions`.
- Updated `PartitionAwareClusteringPlanStrategy` to compute missing
partitions for manually selected partitions as `currentWindow -
selectedPartitions`.
- Used insertion-order preserving de-duplication for deterministic missing
partition output.
- Added unit coverage for missing partition calculation with incremental
table service enabled and disabled.
- Updated incremental clustering regression coverage to verify
regex-filtered partitions are retained as missing.
No code was copied.
### Impact
No public API, storage format, or config changes.
This changes incremental clustering scheduling behavior for
partition-filtered plans: partitions in the current incremental scheduling
window that are not selected by regex or manual target partition selection are
retained
in `missingSchedulePartitions` for future scheduling instead of being
dropped.
No performance impact is expected beyond small in-memory set operations
over the current scheduling window.
### Risk Level
low
The change is limited to clustering plan generation metadata. It does not
change executor incremental window calculation or clustering execution.
Verification:
- `git diff --check`
- `mvn -pl hudi-client/hudi-client-common -am -DskipITs
-Dcheckstyle.skip=true -DfailIfNoTests=false
-Dsurefire.failIfNoSpecifiedTests=false
-Dtest=TestPartitionAwareClusteringPlanStrategy test`
### Documentation Update
none
No new config, API, or user-facing feature is added.
### Contributor's checklist
- [x] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Enough context is provided in the sections above
- [x] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]