kbuci opened a new pull request, #18302: URL: https://github.com/apache/hudi/pull/18302
### Describe the issue this Pull Request addresses When using `PreferWriterConflictResolutionStrategy` for multi-writer setups, clustering jobs can fail and leave behind incomplete `replacecommit` instants on the timeline. These stale clustering instants block future writes targeting the same file groups and require manual intervention to clean up. This PR introduces automatic rollback of failed clustering instants with expired heartbeats, gated behind a new configuration so it is opt-in for users who need it. Closes #18050 ### Summary and Changelog Adds opt-in support for automatically rolling back failed/stale clustering instants during the `rollbackFailedWrites` flow (LAZY cleaning policy), and a utility for partition-targeted rollback of failed clustering. **New Configurations:** - `hoodie.rollback.failed.clustering` (default: `false`): Enables rollback of incomplete clustering instants with expired heartbeats. Automatically inferred as `true` when `PreferWriterConflictResolutionStrategy` is the configured conflict resolution strategy. - `hoodie.rollback.failed.clustering.wait.minutes` (default: `60`): Minimum age (in minutes) a clustering instant must have before it is eligible for rollback. Acts as a guardrail against rolling back transiently failing clustering operations. **Behavioral Changes:** - `HoodieWriteConfig.autoAdjustConfigsForConcurrencyMode`: When `PreferWriterConflictResolutionStrategy` is enabled, the clustering updates strategy is automatically set to `SparkAllowUpdateStrategy` so that ingestion writes can proceed even when there is inflight clustering targeting the same file groups. - `BaseHoodieTableServiceClient.getInstantsToRollback`: Under the LAZY failed writes cleaning policy, eligible incomplete clustering instants (old enough, config enabled, confirmed as clustering action) are now included in the inflight stream before heartbeat-based expiry filtering. - `BaseHoodieTableServiceClient.getInstantsToRollbackForLazyCleanPolicy`: The double-check after timeline reload now also considers the pending replace/clustering timeline when the config is enabled, so that expired clustering instants are not inadvertently filtered out. - New helper `BaseHoodieTableServiceClient.isClusteringInstantEligibleForRollback`: Encapsulates the check for whether an instant is a clustering instant that is old enough and the rollback config is enabled. - - `BaseHoodieTableServiceClient.getPendingRollbackInfos`: Uses the new helper to allow re-attempting pending rollback plans for eligible clustering instants. **New Utilities in `HoodieClusteringJob`:** - `getPendingClusteringInstantsForPartitions(metaClient, partitions)`: Returns all pending clustering instant times that target any of the given partitions. - `rollbackFailedClusteringForPartitions(client, metaClient, partitions)`: Rolls back pending clustering instants targeting the given partitions, filtering for eligibility (config enabled, old enough, clustering action) and expired heartbeat. **Tests:** - Unit tests in `TestHoodieWriteConfig` for new config defaults, explicit enable, inference from PreferWriterConflictResolutionStrategy, and auto-adjustment of clustering update strategy. - Unit tests in `TestBaseHoodieTableServiceClient` for `isClusteringInstantEligibleForRollback` and `getInstantsToRollback` behavior with clustering instants under various conditions (config disabled, too recent, eligible, non-clustering, active vs expired heartbeat). - Integration tests in `TestHoodieClusteringJob` for `getPendingClusteringInstantsForPartitions` and `rollbackFailedClusteringForPartitions` (expired heartbeat triggers rollback, active heartbeat skips rollback). ### Impact - Two new user-facing configurations: `hoodie.rollback.failed.clustering` and `hoodie.rollback.failed.clustering.wait.minutes`. - When `PreferWriterConflictResolutionStrategy` is used, `hoodie.clustering.updates.strategy` is now auto-set to `SparkAllowUpdateStrategy`. - No breaking changes. The new behavior is entirely opt-in (disabled by default). Existing users who do not use `PreferWriterConflictResolutionStrategy` are unaffected. ### Risk Level Low. The rollback of failed clustering instants is gated behind a config that defaults to `false` and only activates for instants that are old enough (configurable wait time) with expired heartbeats. The auto-adjustment of the clustering update strategy only applies when `PreferWriterConflictResolutionStrategy` is already in use. Unit and integration tests cover the key scenarios. ### Documentation Update - Config description for `hoodie.rollback.failed.clustering` and `hoodie.rollback.failed.clustering.wait.minutes` is included in the config property definitions with inline documentation. - The auto-adjustment of `hoodie.clustering.updates.strategy` when using `PreferWriterConflictResolutionStrategy` is logged at INFO level. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
