shrirangmhalgi opened a new pull request, #56274:
URL: https://github.com/apache/spark/pull/56274

   
   ### What changes were proposed in this pull request?
   In YARN client mode, `YarnClientSchedulerBackend`'s `MonitorThread` only 
catches `InterruptedException` / `InterruptedIOException`. If any other 
exception occurs during application monitoring (e.g., network failure, 
credential expiration, or other runtime errors), the thread dies silently. 
Since the driver JVM has active non-daemon threads (SparkUI, heartbeats), the 
process hangs indefinitely in a zombie state.
   
   This patch adds a `NonFatal` catch clause that logs the error and calls 
`sc.stop()`, ensuring the driver shuts down cleanly.
   
   ### Why are the changes needed?
   In managed environments (cloud platform agents, workflow schedulers), a hung 
driver is indistinguishable from one doing legitimate post-execution work. This 
causes resource leakage, orphaned processes, and extended job timeout durations.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. Previously, certain failures in the monitor thread caused the driver to 
hang forever. Now the driver shuts down cleanly with an error log.
   
   
   ### How was this patch tested?
   Added a new test in `YarnClientSchedulerBackendSuite` with a test that mocks 
`Client.monitorApplication` to throw a `RuntimeException` and asserts 
`sc.stop()` is called (via `SparkListener.onApplicationEnd`).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Yes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to