Raj created FLINK-39998:
---------------------------

             Summary:  RocksDBStateDownloader loses parallel download failures, 
surfacing a misleading ClosedChannelException instead of the real root cause
                 Key: FLINK-39998
                 URL: https://issues.apache.org/jira/browse/FLINK-39998
             Project: Flink
          Issue Type: Bug
          Components: Runtime / State Backends
    Affects Versions: 2.1.3, 1.20.5, 2.0.0
            Reporter: Raj
         Attachments: Screenshot 2026-06-26 at 3.23.42 PM.png

When RocksDBStateDownloader.transferAllStateDataToDirectory fails during 
incremental state restore, the exception surfaced to the operator is always 
ClosedChannelException — regardless of the actual root cause. This makes 
diagnosing restore failures extremely difficult.

*Example scenario:* A checkpoint file is accidentally deleted from S3. On job 
restart, Flink keeps crashing with:
  Caused by: java.io.IOException: java.nio.channels.ClosedChannelException
      at 
org.apache.flink.state.rocksdb.RocksDBStateDownloader.downloadDataForStateHandle

There is no indication of which file is missing, which checkpoint is affected, 
or that S3 is involved at all.
  
*Root cause:*

  FutureUtils.completeAll() collects all parallel thread failures as suppressed 
exceptions on the first exception to arrive (CompletionException). However, 
CompletableFuture.get()
  internally calls JDK's reportGet() which strips the CompletionException 
wrapper before throwing ExecutionException:

  // JDK CompletableFuture.reportGet()
  if (x instanceof CompletionException && x.getCause() != null)
      x = cause;  // strips CompletionException — suppressed list GONE
  throw new ExecutionException(x);

  By the time the catch block in transferAllStateDataToDirectory runs, all 
thread failures except one are permanently lost. Which failure "wins" is 
non-deterministic — it is whichever
thread completes first, which is typically the cascade ClosedChannelException 
(a local operation, fast) rather than the real cause (e.g. 
FileNotFoundException from a remote storage call).
  
 *Expected behavior:*
  Caused by: java.io.IOException: 2 downloads failed with distinct errors:
    [IOException: ClosedChannelException | IOException: FileNotFoundException:
     No such file: 
s3://bucket/checkpoints/.../shared/16040859-c981-4407-bf8d-8fd6bbb66f6f]
     



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to