kbuci opened a new pull request, #18456:
URL: https://github.com/apache/hudi/pull/18456

   ### Describe the issue this Pull Request addresses
   
   `FutureUtils.allOf()` has a race condition that causes the original 
root-cause exception to be silently replaced by a `CancellationException`, 
making it impossible to diagnose failures in any code path that uses it — most 
notably 
[`MultipleSparkJobExecutionStrategy.performClustering()`](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java#L111-L128),
 which executes clustering groups in parallel using `FutureUtils.allOf()`.
   
   **The bug:** In the [`whenComplete` 
callback](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/FutureUtils.java#L50-L55),
 when a future fails, 
[`cancel(true)`](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/FutureUtils.java#L52)
 is called on all other futures *before* 
[`union.completeExceptionally(throwable)`](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/FutureUtils.java#L53).
 Since 
[`CompletableFuture.cancel()`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/CompletableFuture.html#cancel(boolean))
 is equivalent to `completeExceptionally(new CancellationException())`, it 
synchronously completes the cancelled futures and triggers their `whenComplete` 
callbacks. Those callbacks in turn call 
`union.completeExceptionally(CancellationException)`. Because 
[`completeExceptionally`](https://docs.oracle.com/
 
en/java/javase/11/docs/api/java.base/java/util/concurrent/CompletableFuture.html#completeExceptionally(java.lang.Throwable))
 can only succeed once, the `CancellationException` wins and the original 
exception is permanently lost.
   
   Additionally, the internal 
[`BiRelay`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/CompletableFuture.html#allOf(java.util.concurrent.CompletableFuture...))
 tree inside 
[`CompletableFuture.allOf()`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/CompletableFuture.html#allOf(java.util.concurrent.CompletableFuture...))
 also detects the cancelled futures and may independently complete `union` with 
`CancellationException`, creating a three-way race.
   
   **Example:** Suppose clustering has 3 groups running in parallel. Group 1 
fails with `IOException` (e.g., transient DFS error). The `whenComplete` 
callback for Group 1 fires and calls `cancel()` on Groups 2 and 3. Group 2's 
cancellation triggers its own `whenComplete` callback, which calls 
`union.completeExceptionally(CancellationException)` — this succeeds. When 
Group 1's callback then calls `union.completeExceptionally(IOException)`, it's 
a no-op. The caller at `.join()` sees only `CompletionException: 
CancellationException` with no indication of the real failure.
   
   ### Summary and Changelog
   
   - Atomically capture the first failure using an `AtomicReference<Throwable>` 
via `compareAndSet`, which is immune to callback ordering
   - Replace `union.thenApply()` with `union.handle()` to intercept both 
success and failure paths, substituting the real first failure from the 
`AtomicReference` over whatever exception the `allOf` BiRelay propagated
   - Remove the now-unnecessary `union.completeExceptionally()` call
   - Added `TestFutureUtils` with tests for: all-succeed, 
single-failure-preserves-exception-and-cancels-others, and concurrent failure 
scenario
   
   ### Impact
   
   Any code path using `FutureUtils.allOf()` will now correctly propagate the 
original root-cause exception instead of a misleading `CancellationException`. 
The primary user-facing impact is in clustering execution 
(`MultipleSparkJobExecutionStrategy`), where failures will now produce 
actionable error messages instead of opaque `CancellationException`s.
   
   No public API changes.
   
   ### Risk Level
   
   low — The change is limited to `FutureUtils.allOf()` and preserves the same 
behavioral contract (fail-fast with cancellation of remaining futures). The 
only difference is which exception is propagated. Unit tests are added to 
verify correctness.
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to