shrirangmhalgi opened a new pull request, #56274: URL: https://github.com/apache/spark/pull/56274
### What changes were proposed in this pull request? In YARN client mode, `YarnClientSchedulerBackend`'s `MonitorThread` only catches `InterruptedException` / `InterruptedIOException`. If any other exception occurs during application monitoring (e.g., network failure, credential expiration, or other runtime errors), the thread dies silently. Since the driver JVM has active non-daemon threads (SparkUI, heartbeats), the process hangs indefinitely in a zombie state. This patch adds a `NonFatal` catch clause that logs the error and calls `sc.stop()`, ensuring the driver shuts down cleanly. ### Why are the changes needed? In managed environments (cloud platform agents, workflow schedulers), a hung driver is indistinguishable from one doing legitimate post-execution work. This causes resource leakage, orphaned processes, and extended job timeout durations. ### Does this PR introduce _any_ user-facing change? Yes. Previously, certain failures in the monitor thread caused the driver to hang forever. Now the driver shuts down cleanly with an error log. ### How was this patch tested? Added a new test in `YarnClientSchedulerBackendSuite` with a test that mocks `Client.monitorApplication` to throw a `RuntimeException` and asserts `sc.stop()` is called (via `SparkListener.onApplicationEnd`). ### Was this patch authored or co-authored using generative AI tooling? Yes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
