sl3635 opened a new pull request, #3617:
URL: https://github.com/apache/celeborn/pull/3617

   ### What changes were proposed in this pull request?
   
   Fix a bug in `CelebornInputStream` where a `RuntimeException` thrown during 
best-effort stream cleanup prevents peer failover when a primary worker becomes 
unregistered.
   
   In `createReaderWithRetry`, when reader creation fails on the primary, the 
code tries to close the existing stream by calling 
`clientFactory.createClient()` before switching to the peer. This cleanup was 
wrapped in `catch (InterruptedException | IOException ex)`. When SASL 
authentication is configured, `SaslClientBootstrap` wraps `IOException` in 
`RuntimeException`, so the cleanup call also throws `RuntimeException`. This 
uncaught exception escapes the retry loop entirely, bypassing `location = 
location.getPeer()` and causing the executor to exhaust retries on the same 
failed primary worker.
   
   The fix changes the cleanup catch to `catch (Exception ex)` so that any 
exception during best-effort cleanup is logged and swallowed, allowing the peer 
switch to proceed.
   
   ### Why are the changes needed?
   
   Without this fix, when a worker pod is rotated or becomes unregistered and 
SASL authentication is enabled, the replica retry mechanism silently fails. The 
executor retries multiple times on the same dead primary worker and eventually 
fails the task, even though a healthy replica exists.
   
   ### Does this PR resolve a correctness bug?
   
   Yes.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Added `CelebornInputStreamPeerFailoverTest` with three unit tests:
   - `testPeerFailoverWithRuntimeExceptionDuringCleanup`: primary fails, 
cleanup throws `RuntimeException` (simulates SASL wrapping), replica succeeds — 
verifies the fix
   - `testPeerFailoverWithIOExceptionDuringCleanup`: same scenario with plain 
`IOException` during cleanup — verifies existing behavior is preserved
   - `testFailureWithoutPeer`: no replica configured, verifies retries are 
exhausted and `CelebornIOException` is thrown


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to