kbuci opened a new pull request, #18302:
URL: https://github.com/apache/hudi/pull/18302

   ### Describe the issue this Pull Request addresses
   
   When using `PreferWriterConflictResolutionStrategy` for multi-writer setups, 
clustering jobs can fail and leave behind incomplete `replacecommit` instants 
on the timeline. These stale clustering instants block future writes targeting 
the same file groups and require manual intervention to clean up. This PR 
introduces automatic rollback of failed clustering instants with expired 
heartbeats, gated behind a new configuration so it is opt-in for users who need 
it.
   
   Closes #18050
   
   ### Summary and Changelog
   
   Adds opt-in support for automatically rolling back failed/stale clustering 
instants during the `rollbackFailedWrites` flow (LAZY cleaning policy), and a 
utility for partition-targeted rollback of failed clustering.
   
   **New Configurations:**
   - `hoodie.rollback.failed.clustering` (default: `false`): Enables rollback 
of incomplete clustering instants with expired heartbeats. Automatically 
inferred as `true` when `PreferWriterConflictResolutionStrategy` is the 
configured conflict resolution strategy.
   - `hoodie.rollback.failed.clustering.wait.minutes` (default: `60`): Minimum 
age (in minutes) a clustering instant must have before it is eligible for 
rollback. Acts as a guardrail against rolling back transiently failing 
clustering operations.
   
   **Behavioral Changes:**
   - `HoodieWriteConfig.autoAdjustConfigsForConcurrencyMode`: When 
`PreferWriterConflictResolutionStrategy` is enabled, the clustering updates 
strategy is automatically set to `SparkAllowUpdateStrategy` so that ingestion 
writes can proceed even when there is inflight clustering targeting the same 
file groups.
   - `BaseHoodieTableServiceClient.getInstantsToRollback`: Under the LAZY 
failed writes cleaning policy, eligible incomplete clustering instants (old 
enough, config enabled, confirmed as clustering action) are now included in the 
inflight stream before heartbeat-based expiry filtering.
   - `BaseHoodieTableServiceClient.getInstantsToRollbackForLazyCleanPolicy`: 
The double-check after timeline reload now also considers the pending 
replace/clustering timeline when the config is enabled, so that expired 
clustering instants are not inadvertently filtered out.
   - New helper 
`BaseHoodieTableServiceClient.isClusteringInstantEligibleForRollback`: 
Encapsulates the check for whether an instant is a clustering instant that is 
old enough and the rollback config is enabled.
   - - `BaseHoodieTableServiceClient.getPendingRollbackInfos`: Uses the new 
helper to allow re-attempting pending rollback plans for eligible clustering 
instants.
   
   **New Utilities in `HoodieClusteringJob`:**
   - `getPendingClusteringInstantsForPartitions(metaClient, partitions)`: 
Returns all pending clustering instant times that target any of the given 
partitions.
   - `rollbackFailedClusteringForPartitions(client, metaClient, partitions)`: 
Rolls back pending clustering instants targeting the given partitions, 
filtering for eligibility (config enabled, old enough, clustering action) and 
expired heartbeat.
   
   **Tests:**
   - Unit tests in `TestHoodieWriteConfig` for new config defaults, explicit 
enable, inference from PreferWriterConflictResolutionStrategy, and 
auto-adjustment of clustering update strategy.
   - Unit tests in `TestBaseHoodieTableServiceClient` for 
`isClusteringInstantEligibleForRollback` and `getInstantsToRollback` behavior 
with clustering instants under various conditions (config disabled, too recent, 
eligible, non-clustering, active vs expired heartbeat).
   - Integration tests in `TestHoodieClusteringJob` for 
`getPendingClusteringInstantsForPartitions` and 
`rollbackFailedClusteringForPartitions` (expired heartbeat triggers rollback, 
active heartbeat skips rollback).
   
   ### Impact
   
   - Two new user-facing configurations: `hoodie.rollback.failed.clustering` 
and `hoodie.rollback.failed.clustering.wait.minutes`.
   - When `PreferWriterConflictResolutionStrategy` is used, 
`hoodie.clustering.updates.strategy` is now auto-set to 
`SparkAllowUpdateStrategy`.
   - No breaking changes. The new behavior is entirely opt-in (disabled by 
default). Existing users who do not use 
`PreferWriterConflictResolutionStrategy` are unaffected.
   
   ### Risk Level
   
   Low. The rollback of failed clustering instants is gated behind a config 
that defaults to `false` and only activates for instants that are old enough 
(configurable wait time) with expired heartbeats. The auto-adjustment of the 
clustering update strategy only applies when 
`PreferWriterConflictResolutionStrategy` is already in use. Unit and 
integration tests cover the key scenarios.
   
   ### Documentation Update
   
   - Config description for `hoodie.rollback.failed.clustering` and 
`hoodie.rollback.failed.clustering.wait.minutes` is included in the config 
property definitions with inline documentation.
   - The auto-adjustment of `hoodie.clustering.updates.strategy` when using 
`PreferWriterConflictResolutionStrategy` is logged at INFO level.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to