nsivabalan opened a new pull request, #18829:
URL: https://github.com/apache/hudi/pull/18829

   ### Change Logs
   
   - Fixes a correctness bug where `INSERT_OVERWRITE` and 
`INSERT_OVERWRITE_TABLE` could replace file groups that were in pending 
clustering, because the clustering update strategies only inspected explicit 
record-level updates and never saw the file groups being wholesale replaced.
   - Extends `UpdateStrategy` with a `fileGroupsToBeReplaced` set populated by 
the replace-style commit executors.
   - `SparkRejectUpdateStrategy` (the default) now unions explicit updates with 
the to-be-replaced set before checking against pending clustering, and throws 
`HoodieClusteringUpdateException` if they overlap.
   
   ### Implementation
   
   - `UpdateStrategy` / `BaseSparkUpdateStrategy` carry both the 
pending-clustering set and the to-be-replaced set.
   - `BaseSparkCommitActionExecutor` exposes `getFileGroupsBeingReplaced()` 
(empty by default) and passes both sets when instantiating the update strategy.
   - `SparkInsertOverwriteCommitActionExecutor` overrides the hook to compute 
the file groups in the partitions being overwritten (static via 
`STATIC_OVERWRITE_PARTITION_PATHS`, dynamic via the input record partitions).
   - `SparkInsertOverwriteTableCommitActionExecutor` overrides the hook to 
enumerate every file group across every partition, matching the table-wide 
replace semantics of its `getPartitionToReplacedFileIds`.
   - Flink `ConsistentBucketUpdateStrategy` / 
`FlinkConsistentBucketUpdateStrategy` pass an empty `fileGroupsToBeReplaced` to 
the new 4-arg `super` constructor so the Flink build keeps compiling.
   
   ### What is NOT covered (intentionally, in this PR)
   
   - `SparkAllowUpdateStrategy` and 
`SparkConsistentBucketDuplicateUpdateStrategy` accept the new parameter today 
but do not consult it; tightening those is left to a follow-up. These are 
opt-in non-default strategies.
   - `DELETE_PARTITION` already has a separate pre-existing check 
(`DeletePartitionUtils.checkForPendingTableServiceActions`) and bypasses the 
`clusteringHandleUpdate` pipeline, so it is unaffected by this change.
   
   ### Impact
   
   `INSERT_OVERWRITE` / `INSERT_OVERWRITE_TABLE` against a partition that has 
pending clustering will now correctly abort with 
`HoodieClusteringUpdateException` ("Not allowed to update the clustering file 
group ...") under the default `SparkRejectUpdateStrategy`, instead of silently 
winning the race against the clustering plan.
   
   ### Risk level
   
   low. Default-strategy behavior strictly tightens detection of a known race. 
Allow/bucket strategies and Flink subclasses are pass-through.
   
   ### Test Coverage
   
   `TestInsertOverwriteWithClustering` (new) covers, all on COPY_ON_WRITE:
   - static & dynamic `INSERT_OVERWRITE` against pending clustering on 
overlapping partitions — must reject
   - `INSERT_OVERWRITE` against a non-overlapping partition — must succeed 
without rolling back clustering
   - multi-step scenarios with selective overlap (overwrite one partition, then 
a second that overlaps clustering)
   - `DELETE_PARTITION` against a non-overlapping partition — must succeed 
without rolling back clustering
   - direct validation of `getFileGroupsBeingReplaced` and 
`getPartitionToReplacedFileIds` for static, dynamic, multi-file-group, and 
empty-partition cases
   
   12 tests, all passing locally against current master (`mvn -pl 
hudi-client/hudi-spark-client -Pspark3.5 
-Dtest=TestInsertOverwriteWithClustering test`).
   
   ### Documentation Update
   
   No new configs or user-facing config-key changes. Existing 
`hoodie.clustering.updates.strategy` default behavior is tightened only — users 
on the default see a clearer error in a previously-buggy scenario.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added
   - [x] CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to