kbuci opened a new issue, #17879: URL: https://github.com/apache/hudi/issues/17879
### Task Description **What needs to be done:** A new `HoodieWriteConfig` key should be added to indicate that any incomplete clustering plans with an expired heartbeat should be completely rolled back by `rollbackFailedWrites`, with the plan deleted. HUDI will assume that clustering plans are only be attempted once (by the same writer client that scheduled the plan). Specifically, HUDI should do the following if and only if this config key is enabled - `scheduleClustering` should start a heartbeat for determined instant time before publishing the plan to timeline. It should not start the heartbeat when executing the clustering plan in `cluster` - `rollbackFailedWrites` should schedule and execute rollback plan of any incomplete clustering instants with expired heartbeat - provide a new utility API `rollbackClusteringWithMatchingPartitions` that takes a list of partitions and attempts a rollback of any incomplete clustering plans targeting the same partition (provided they have expired heartbeat) Note that we are making the following assumptions - The value of this config key is the same across all writers to a dataset (given a multi-writer setup) - If two writers attempt to rollback the same instant, there will be no data corruption nor exception thrown. **Why this task is needed** For our datasets, we do not perform clustering in the same writer job as ingestion writes. Rather, we orchestrate separate service that attempt clustering. In addition, we can have concurrent jobs attempt to cluster the same datasets with different types of clustering plans (for example, we may stitch files in newer and older partitions with different strategies). Given this, we had to implement the above requirements in our internal HUDI 0.x build in order to ensure: - if we cannot re-execute an existing clustering plan (due more spark resources that are currently available), then we can ensure these inflight plans are rolled back without blocking archival/clean/metadata table services. Since for us clustering typically has a lower priority than other operations. - When HUDI schedules a clustering plan, it will ignore files already targeted by other incomplete clustering plans. This is desired behavior, but it means that our clustering service job must be able to rollback any existing "leftover" clustering plans targeting the same partition before proceeding to create a new clustering plan. Since the current job may not have sufficient spark resourcing to re-attempt the plan, and the execution of this "leftover" plan may not anyway be of high priority. Once we achieve consensus on these requirements, we can start upstreaming our implementations ### Task Type Code improvement/refactoring ### Related Issues **Parent feature issue:** (if applicable ) **Related issues:** NOTE: Use `Relationships` button to add parent/blocking issues after issue is created. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
