Raj created FLINK-39998:
---------------------------
Summary: RocksDBStateDownloader loses parallel download failures,
surfacing a misleading ClosedChannelException instead of the real root cause
Key: FLINK-39998
URL: https://issues.apache.org/jira/browse/FLINK-39998
Project: Flink
Issue Type: Bug
Components: Runtime / State Backends
Affects Versions: 2.1.3, 1.20.5, 2.0.0
Reporter: Raj
Attachments: Screenshot 2026-06-26 at 3.23.42 PM.png
When RocksDBStateDownloader.transferAllStateDataToDirectory fails during
incremental state restore, the exception surfaced to the operator is always
ClosedChannelException — regardless of the actual root cause. This makes
diagnosing restore failures extremely difficult.
*Example scenario:* A checkpoint file is accidentally deleted from S3. On job
restart, Flink keeps crashing with:
Caused by: java.io.IOException: java.nio.channels.ClosedChannelException
at
org.apache.flink.state.rocksdb.RocksDBStateDownloader.downloadDataForStateHandle
There is no indication of which file is missing, which checkpoint is affected,
or that S3 is involved at all.
*Root cause:*
FutureUtils.completeAll() collects all parallel thread failures as suppressed
exceptions on the first exception to arrive (CompletionException). However,
CompletableFuture.get()
internally calls JDK's reportGet() which strips the CompletionException
wrapper before throwing ExecutionException:
// JDK CompletableFuture.reportGet()
if (x instanceof CompletionException && x.getCause() != null)
x = cause; // strips CompletionException — suppressed list GONE
throw new ExecutionException(x);
By the time the catch block in transferAllStateDataToDirectory runs, all
thread failures except one are permanently lost. Which failure "wins" is
non-deterministic — it is whichever
thread completes first, which is typically the cascade ClosedChannelException
(a local operation, fast) rather than the real cause (e.g.
FileNotFoundException from a remote storage call).
*Expected behavior:*
Caused by: java.io.IOException: 2 downloads failed with distinct errors:
[IOException: ClosedChannelException | IOException: FileNotFoundException:
No such file:
s3://bucket/checkpoints/.../shared/16040859-c981-4407-bf8d-8fd6bbb66f6f]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)