[
https://issues.apache.org/jira/browse/SPARK-57191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kousuke Saruta resolved SPARK-57191.
------------------------------------
Fix Version/s: 4.2.0
Assignee: Shrirang Mhalgi
Resolution: Fixed
Issue resolved by https://github.com/apache/spark/pull/56274
> [YARN] Driver hangs indefinitely when job submission / monitor thread fails
> ---------------------------------------------------------------------------
>
> Key: SPARK-57191
> URL: https://issues.apache.org/jira/browse/SPARK-57191
> Project: Spark
> Issue Type: Bug
> Components: YARN
> Affects Versions: 4.1.2
> Reporter: Rohan Arora
> Assignee: Shrirang Mhalgi
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.2.0
>
>
> h4. *Overview*
> In Spark-on-YARN client mode deployment, if a fatal uncaught exception is
> thrown within the asynchronous application-submission or
> application-monitoring thread (e.g., during initialisation inside
> {{YarnClientSchedulerBackend}} or YARN {{{}Client.scala{}}}), the Spark
> Driver process hangs indefinitely instead of shutting down or throwing the
> exception to the main thread.
> h4. *Root Cause Analysis*
> # {*}Asynchronous Execution{*}: When YARN client mode starts,
> {{YarnClientSchedulerBackend}} submits the Spark application context to YARN
> and monitors it asynchronously (e.g., utilising the internal
> {{MonitorThread}} or Scala standard {{Future}} contexts).
> # {*}Exception Swallowing/Isolation{*}: If a fatal exception occurs in these
> background threads (such as network failure, credential expiration, or
> {{OutOfMemoryError}} during the initial handshake), the exception is either
> swallowed by Scala {{Future}} execution pools or isolate-trapped in a thread
> not guarded by Spark’s custom {{{}SparkUncaughtExceptionHandler{}}}.
> # {*}Blocker Threads Inactive{*}: Main threads (like the one executing
> {{SparkContext.init}} or {{{}waitForApplication{}}}) remain indefinitely
> blocked waiting on the future completion or lock notification.
> # {*}Zombie JVM State{*}: Since the driver process has already spun up
> active non-daemon threads (such as heartbeats, Spark UI HTTP server, and log
> appenders), the JVM does not exit naturally, leaving the driver in a
> zombie/hung state.
> h4. *Impact on Managed Environments*
> In orchestration and managed environments (such as cloud platform agents,
> workflows, schedulers), the agent continues to report the job driver process
> as active. The scheduler cannot distinguish this hung driver from a driver
> performing legitimate post-execution cleanup (like metastore synchronization
> or final file renaming). This leads to resource leakages, orphaned driver
> processes, and long job timeout durations for customers.
> h4. *Proposed Solution*
> * {*}Exception Propagation{*}: Ensure that worker thread closures and
> background futures executing YARN submissions are wrapped in robust
> {{try-catch}} blocks that propagate exceptions to Spark's uncaught exception
> handler ({{{}ThreadUtils.runInNewThread{}}} should be leveraged for thread
> instantiation).
> * {*}Explicit Teardown on Failure{*}: On critical failures inside the
> submission or monitoring loops, explicitly trigger {{SparkContext.stop()}} or
> standard JVM termination ({{{}System.exit(exitCode){}}}) so that the main
> thread does not block infinitely on states that will never resolve.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]