kbuci opened a new pull request, #18456: URL: https://github.com/apache/hudi/pull/18456
### Describe the issue this Pull Request addresses `FutureUtils.allOf()` has a race condition that causes the original root-cause exception to be silently replaced by a `CancellationException`, making it impossible to diagnose failures in any code path that uses it — most notably [`MultipleSparkJobExecutionStrategy.performClustering()`](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java#L111-L128), which executes clustering groups in parallel using `FutureUtils.allOf()`. **The bug:** In the [`whenComplete` callback](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/FutureUtils.java#L50-L55), when a future fails, [`cancel(true)`](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/FutureUtils.java#L52) is called on all other futures *before* [`union.completeExceptionally(throwable)`](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/FutureUtils.java#L53). Since [`CompletableFuture.cancel()`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/CompletableFuture.html#cancel(boolean)) is equivalent to `completeExceptionally(new CancellationException())`, it synchronously completes the cancelled futures and triggers their `whenComplete` callbacks. Those callbacks in turn call `union.completeExceptionally(CancellationException)`. Because [`completeExceptionally`](https://docs.oracle.com/ en/java/javase/11/docs/api/java.base/java/util/concurrent/CompletableFuture.html#completeExceptionally(java.lang.Throwable)) can only succeed once, the `CancellationException` wins and the original exception is permanently lost. Additionally, the internal [`BiRelay`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/CompletableFuture.html#allOf(java.util.concurrent.CompletableFuture...)) tree inside [`CompletableFuture.allOf()`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/CompletableFuture.html#allOf(java.util.concurrent.CompletableFuture...)) also detects the cancelled futures and may independently complete `union` with `CancellationException`, creating a three-way race. **Example:** Suppose clustering has 3 groups running in parallel. Group 1 fails with `IOException` (e.g., transient DFS error). The `whenComplete` callback for Group 1 fires and calls `cancel()` on Groups 2 and 3. Group 2's cancellation triggers its own `whenComplete` callback, which calls `union.completeExceptionally(CancellationException)` — this succeeds. When Group 1's callback then calls `union.completeExceptionally(IOException)`, it's a no-op. The caller at `.join()` sees only `CompletionException: CancellationException` with no indication of the real failure. ### Summary and Changelog - Atomically capture the first failure using an `AtomicReference<Throwable>` via `compareAndSet`, which is immune to callback ordering - Replace `union.thenApply()` with `union.handle()` to intercept both success and failure paths, substituting the real first failure from the `AtomicReference` over whatever exception the `allOf` BiRelay propagated - Remove the now-unnecessary `union.completeExceptionally()` call - Added `TestFutureUtils` with tests for: all-succeed, single-failure-preserves-exception-and-cancels-others, and concurrent failure scenario ### Impact Any code path using `FutureUtils.allOf()` will now correctly propagate the original root-cause exception instead of a misleading `CancellationException`. The primary user-facing impact is in clustering execution (`MultipleSparkJobExecutionStrategy`), where failures will now produce actionable error messages instead of opaque `CancellationException`s. No public API changes. ### Risk Level low — The change is limited to `FutureUtils.allOf()` and preserves the same behavioral contract (fail-fast with cancellation of remaining futures). The only difference is which exception is propagated. Unit tests are added to verify correctness. ### Documentation Update none ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
