sl3635 opened a new pull request, #3617: URL: https://github.com/apache/celeborn/pull/3617
### What changes were proposed in this pull request? Fix a bug in `CelebornInputStream` where a `RuntimeException` thrown during best-effort stream cleanup prevents peer failover when a primary worker becomes unregistered. In `createReaderWithRetry`, when reader creation fails on the primary, the code tries to close the existing stream by calling `clientFactory.createClient()` before switching to the peer. This cleanup was wrapped in `catch (InterruptedException | IOException ex)`. When SASL authentication is configured, `SaslClientBootstrap` wraps `IOException` in `RuntimeException`, so the cleanup call also throws `RuntimeException`. This uncaught exception escapes the retry loop entirely, bypassing `location = location.getPeer()` and causing the executor to exhaust retries on the same failed primary worker. The fix changes the cleanup catch to `catch (Exception ex)` so that any exception during best-effort cleanup is logged and swallowed, allowing the peer switch to proceed. ### Why are the changes needed? Without this fix, when a worker pod is rotated or becomes unregistered and SASL authentication is enabled, the replica retry mechanism silently fails. The executor retries multiple times on the same dead primary worker and eventually fails the task, even though a healthy replica exists. ### Does this PR resolve a correctness bug? Yes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added `CelebornInputStreamPeerFailoverTest` with three unit tests: - `testPeerFailoverWithRuntimeExceptionDuringCleanup`: primary fails, cleanup throws `RuntimeException` (simulates SASL wrapping), replica succeeds — verifies the fix - `testPeerFailoverWithIOExceptionDuringCleanup`: same scenario with plain `IOException` during cleanup — verifies existing behavior is preserved - `testFailureWithoutPeer`: no replica configured, verifies retries are exhausted and `CelebornIOException` is thrown -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
