sameer2800 opened a new pull request, #28552:
URL: https://github.com/apache/flink/pull/28552

   
   ### ⏺ What is the purpose of the change
   
   When parallel RocksDB state downloads fail, only the first exception to 
arrive is surfaced — all other thread failures are silently lost. This is 
non-deterministic: whichever thread happens to complete first "wins" the 
primary exception slot, which is often a cascade ClosedChannelException rather 
than the actual failure.
   Consider a real scenario: a checkpoint file gets accidentally deleted from 
S3. On job restart, Flink keeps crashing. The only error visible in the logs is 
ClosedChannelException, which points to a local file write issue and gives no 
hint that a remote file is missing. The on-call engineer has no indication of 
which file is gone, which checkpoint is affected, or that S3 is involved at all 
— making diagnosis extremely painful.
   
   
   **Before**:
     Caused by: java.io.IOException: java.nio.channels.ClosedChannelException
         at 
org.apache.flink.state.rocksdb.RocksDBStateDownloader.downloadDataForStateHandle
   
     **After**:
     Caused by: java.io.IOException: 2 downloads failed with distinct errors:
       [IOException: ClosedChannelException | IOException: 
FileNotFoundException:
        No such file: 
s3://bucket/checkpoints/.../shared/16040859-c981-4407-bf8d-8fd6bbb66f6f]
   
   
    The root cause: FutureUtils.completeAll() collects all thread failures as 
suppressed exceptions on a CompletionException. However, 
CompletableFuture.get() internally strips that wrapper
     before throwing ExecutionException — so by the time the catch block runs, 
all parallel failures except one are permanently gone.
   
   ### Brief change log                                                         
                                                                          
   
     - RocksDBStateDownloader — registers a whenComplete callback on the 
download future to capture the raw CompletionException (with full suppressed 
list) before CompletableFuture.get() strips it
     - Introduces buildDownloadException which strips the wrapper chain of each 
collected failure (CompletionException → RuntimeException → IOException), 
deduplicates by type and message, and returns either the single root cause 
directly or a merged exception listing all distinct failures with full stack 
traces preserved as suppressed entries
   
   ###  Verifying this change
   
     This change added tests and can be verified as follows:
   
     - testSingleDownloadFailureSurfacedDirectly — verifies that a single 
download failure returns the root cause exception directly without a merged 
message
     - testRootCauseVisibleAmongCascadeFailures — verifies that the root cause 
IOException is visible in the error when one handle fails among many parallel 
ByteStreamStateHandle downloads
     - testMultipleDistinctFailuresMergedInMessage — verifies that distinct 
failures across N threads (coordinated via CyclicBarrier to prevent 
registry-closure cascade) all appear in the merged error message
   
   ###   Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
@Public(Evolving): no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (improves error 
reporting on restore failure only)
     - The S3 file system connector: no
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to