sameer2800 opened a new pull request, #28552:
URL: https://github.com/apache/flink/pull/28552
### ⏺ What is the purpose of the change
When parallel RocksDB state downloads fail, only the first exception to
arrive is surfaced — all other thread failures are silently lost. This is
non-deterministic: whichever thread happens to complete first "wins" the
primary exception slot, which is often a cascade ClosedChannelException rather
than the actual failure.
Consider a real scenario: a checkpoint file gets accidentally deleted from
S3. On job restart, Flink keeps crashing. The only error visible in the logs is
ClosedChannelException, which points to a local file write issue and gives no
hint that a remote file is missing. The on-call engineer has no indication of
which file is gone, which checkpoint is affected, or that S3 is involved at all
— making diagnosis extremely painful.
**Before**:
Caused by: java.io.IOException: java.nio.channels.ClosedChannelException
at
org.apache.flink.state.rocksdb.RocksDBStateDownloader.downloadDataForStateHandle
**After**:
Caused by: java.io.IOException: 2 downloads failed with distinct errors:
[IOException: ClosedChannelException | IOException:
FileNotFoundException:
No such file:
s3://bucket/checkpoints/.../shared/16040859-c981-4407-bf8d-8fd6bbb66f6f]
The root cause: FutureUtils.completeAll() collects all thread failures as
suppressed exceptions on a CompletionException. However,
CompletableFuture.get() internally strips that wrapper
before throwing ExecutionException — so by the time the catch block runs,
all parallel failures except one are permanently gone.
### Brief change log
- RocksDBStateDownloader — registers a whenComplete callback on the
download future to capture the raw CompletionException (with full suppressed
list) before CompletableFuture.get() strips it
- Introduces buildDownloadException which strips the wrapper chain of each
collected failure (CompletionException → RuntimeException → IOException),
deduplicates by type and message, and returns either the single root cause
directly or a merged exception listing all distinct failures with full stack
traces preserved as suppressed entries
### Verifying this change
This change added tests and can be verified as follows:
- testSingleDownloadFailureSurfacedDirectly — verifies that a single
download failure returns the root cause exception directly without a merged
message
- testRootCauseVisibleAmongCascadeFailures — verifies that the root cause
IOException is visible in the error when one handle fails among many parallel
ByteStreamStateHandle downloads
- testMultipleDistinctFailuresMergedInMessage — verifies that distinct
failures across N threads (coordinated via CyclicBarrier to prevent
registry-closure cascade) all appear in the merged error message
### Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
@Public(Evolving): no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (improves error
reporting on restore failure only)
- The S3 file system connector: no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]