[
https://issues.apache.org/jira/browse/FLINK-39998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora closed FLINK-39998.
------------------------------
Fix Version/s: 2.4.0
Resolution: Fixed
merged to master 5d5184e0427a8c575e0bb745cac644ca54eb6963
> RocksDBStateDownloader loses parallel download failures, surfacing a
> misleading ClosedChannelException instead of the real root cause
> --------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-39998
> URL: https://issues.apache.org/jira/browse/FLINK-39998
> Project: Flink
> Issue Type: Bug
> Components: Runtime / State Backends
> Affects Versions: 2.0.0, 1.20.5, 2.1.3
> Reporter: Raj
> Priority: Major
> Labels: pull-request-available, starter
> Fix For: 2.4.0
>
> Attachments: Screenshot 2026-06-26 at 3.23.42 PM.png
>
>
> When RocksDBStateDownloader.transferAllStateDataToDirectory fails during
> incremental state restore, the exception surfaced to the operator is always
> ClosedChannelException — regardless of the actual root cause. This makes
> diagnosing restore failures extremely difficult.
> *Example scenario:* A checkpoint file is accidentally deleted from S3. On job
> restart, Flink keeps crashing with:
> Caused by: java.io.IOException: java.nio.channels.ClosedChannelException
> at
> org.apache.flink.state.rocksdb.RocksDBStateDownloader.downloadDataForStateHandle
> There is no indication of which file is missing, which checkpoint is
> affected, or that S3 is involved at all.
>
> *Root cause:*
> FutureUtils.completeAll() collects all parallel thread failures as
> suppressed exceptions on the first exception to arrive (CompletionException).
> However, CompletableFuture.get()
> internally calls JDK's reportGet() which strips the CompletionException
> wrapper before throwing ExecutionException:
> // JDK CompletableFuture.reportGet()
> if (x instanceof CompletionException && x.getCause() != null)
> x = cause; // strips CompletionException — suppressed list GONE
> throw new ExecutionException(x);
> By the time the catch block in transferAllStateDataToDirectory runs, all
> thread failures except one are permanently lost. Which failure "wins" is
> non-deterministic — it is whichever
> thread completes first, which is typically the cascade ClosedChannelException
> (a local operation, fast) rather than the real cause (e.g.
> FileNotFoundException from a remote storage call).
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)