nsivabalan opened a new issue, #19074:
URL: https://github.com/apache/hudi/issues/19074
### Problem
`INSERT_OVERWRITE` and `INSERT_OVERWRITE_TABLE` can silently replace file
groups that are in pending clustering. The default `SparkRejectUpdateStrategy`
is intended to abort writes that overlap with a clustering plan with
`HoodieClusteringUpdateException`, but the existing implementation only
inspects explicit record-level updates produced by the workload. Replace-style
commits never go through that path — their effect is to *wholesale* replace
file groups in the targeted partitions, which the update strategy never sees.
As a result, an `INSERT_OVERWRITE` against a partition that is in pending
clustering wins the race against the clustering plan: the overwrite commits
successfully and the clustering plan is invalidated after the fact (or rolled
back), instead of being aborted up front as the strategy intends.
### Reproduction
1. Create a COW table with multiple partitions.
2. Write data to partition `P` so it has at least one file group.
3. Schedule (but do not execute) clustering on `P`.
4. Issue an `INSERT_OVERWRITE` whose static or dynamic partition set
includes `P`.
5. Observed: the overwrite proceeds, the pending clustering plan is
invalidated.
6. Expected: under default `hoodie.clustering.updates.strategy =
SparkRejectUpdateStrategy`, the overwrite is rejected with
`HoodieClusteringUpdateException("Not allowed to update the clustering file
group ...")`.
### Scope
- Default `SparkRejectUpdateStrategy` should treat file groups being
replaced by `INSERT_OVERWRITE`/`INSERT_OVERWRITE_TABLE` the same as explicit
record-level updates when checking overlap with pending clustering.
- `SparkAllowUpdateStrategy` and
`SparkConsistentBucketDuplicateUpdateStrategy` are opt-in non-default and out
of scope for the initial fix.
- `DELETE_PARTITION` already has a separate pre-existing check
(`DeletePartitionUtils.checkForPendingTableServiceActions`) and bypasses the
`clusteringHandleUpdate` pipeline, so it is unaffected.
### Proposed fix
- Extend `UpdateStrategy` to carry both the pending-clustering file-group
set and a new "file groups to be replaced" set.
- Replace-style executors (`SparkInsertOverwriteCommitActionExecutor`,
`SparkInsertOverwriteTableCommitActionExecutor`) populate the new set: the
overwrite computes the file groups in the partitions being overwritten (static
via `STATIC_OVERWRITE_PARTITION_PATHS`, dynamic via the input record
partitions); the table-wide overwrite enumerates every file group across every
partition.
- `SparkRejectUpdateStrategy` unions explicit updates with the
to-be-replaced set before the overlap check.
### Affected versions
Master (1.x). Tracked PR: #18829.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]