nsivabalan opened a new issue, #19074:
URL: https://github.com/apache/hudi/issues/19074

   ### Problem
   
   `INSERT_OVERWRITE` and `INSERT_OVERWRITE_TABLE` can silently replace file 
groups that are in pending clustering. The default `SparkRejectUpdateStrategy` 
is intended to abort writes that overlap with a clustering plan with 
`HoodieClusteringUpdateException`, but the existing implementation only 
inspects explicit record-level updates produced by the workload. Replace-style 
commits never go through that path — their effect is to *wholesale* replace 
file groups in the targeted partitions, which the update strategy never sees.
   
   As a result, an `INSERT_OVERWRITE` against a partition that is in pending 
clustering wins the race against the clustering plan: the overwrite commits 
successfully and the clustering plan is invalidated after the fact (or rolled 
back), instead of being aborted up front as the strategy intends.
   
   ### Reproduction
   
   1. Create a COW table with multiple partitions.
   2. Write data to partition `P` so it has at least one file group.
   3. Schedule (but do not execute) clustering on `P`.
   4. Issue an `INSERT_OVERWRITE` whose static or dynamic partition set 
includes `P`.
   5. Observed: the overwrite proceeds, the pending clustering plan is 
invalidated.
   6. Expected: under default `hoodie.clustering.updates.strategy = 
SparkRejectUpdateStrategy`, the overwrite is rejected with 
`HoodieClusteringUpdateException("Not allowed to update the clustering file 
group ...")`.
   
   ### Scope
   
   - Default `SparkRejectUpdateStrategy` should treat file groups being 
replaced by `INSERT_OVERWRITE`/`INSERT_OVERWRITE_TABLE` the same as explicit 
record-level updates when checking overlap with pending clustering.
   - `SparkAllowUpdateStrategy` and 
`SparkConsistentBucketDuplicateUpdateStrategy` are opt-in non-default and out 
of scope for the initial fix.
   - `DELETE_PARTITION` already has a separate pre-existing check 
(`DeletePartitionUtils.checkForPendingTableServiceActions`) and bypasses the 
`clusteringHandleUpdate` pipeline, so it is unaffected.
   
   ### Proposed fix
   
   - Extend `UpdateStrategy` to carry both the pending-clustering file-group 
set and a new "file groups to be replaced" set.
   - Replace-style executors (`SparkInsertOverwriteCommitActionExecutor`, 
`SparkInsertOverwriteTableCommitActionExecutor`) populate the new set: the 
overwrite computes the file groups in the partitions being overwritten (static 
via `STATIC_OVERWRITE_PARTITION_PATHS`, dynamic via the input record 
partitions); the table-wide overwrite enumerates every file group across every 
partition.
   - `SparkRejectUpdateStrategy` unions explicit updates with the 
to-be-replaced set before the overlap check.
   
   ### Affected versions
   
   Master (1.x). Tracked PR: #18829.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to