Rohan Arora created SPARK-57191:
-----------------------------------
Summary: [YARN] Driver hangs indefinitely when job submission /
monitor thread fails
Key: SPARK-57191
URL: https://issues.apache.org/jira/browse/SPARK-57191
Project: Spark
Issue Type: Bug
Components: YARN
Affects Versions: 4.1.2
Reporter: Rohan Arora
h4. *Overview*
In Spark-on-YARN client mode deployment, if a fatal uncaught exception is
thrown within the asynchronous application-submission or application-monitoring
thread (e.g., during initialisation inside {{YarnClientSchedulerBackend}} or
YARN {{{}Client.scala{}}}), the Spark Driver process hangs indefinitely instead
of shutting down or throwing the exception to the main thread.
h4. *Root Cause Analysis*
# {*}Asynchronous Execution{*}: When YARN client mode starts,
{{YarnClientSchedulerBackend}} submits the Spark application context to YARN
and monitors it asynchronously (e.g., utilising the internal {{MonitorThread}}
or Scala standard {{Future}} contexts).
# {*}Exception Swallowing/Isolation{*}: If a fatal exception occurs in these
background threads (such as network failure, credential expiration, or
{{OutOfMemoryError}} during the initial handshake), the exception is either
swallowed by Scala {{Future}} execution pools or isolate-trapped in a thread
not guarded by Spark’s custom {{{}SparkUncaughtExceptionHandler{}}}.
# {*}Blocker Threads Inactive{*}: Main threads (like the one executing
{{SparkContext.init}} or {{{}waitForApplication{}}}) remain indefinitely
blocked waiting on the future completion or lock notification.
# {*}Zombie JVM State{*}: Since the driver process has already spun up active
non-daemon threads (such as heartbeats, Spark UI HTTP server, and log
appenders), the JVM does not exit naturally, leaving the driver in a
zombie/hung state.
h4. *Impact on Managed Environments*
In orchestration and managed environments (such as cloud platform agents,
workflows, schedulers), the agent continues to report the job driver process as
active. The scheduler cannot distinguish this hung driver from a driver
performing legitimate post-execution cleanup (like metastore synchronization or
final file renaming). This leads to resource leakages, orphaned driver
processes, and long job timeout durations for customers.
h4. *Proposed Solution*
* {*}Exception Propagation{*}: Ensure that worker thread closures and
background futures executing YARN submissions are wrapped in robust
{{try-catch}} blocks that propagate exceptions to Spark's uncaught exception
handler ({{{}ThreadUtils.runInNewThread{}}} should be leveraged for thread
instantiation).
* {*}Explicit Teardown on Failure{*}: On critical failures inside the
submission or monitoring loops, explicitly trigger {{SparkContext.stop()}} or
standard JVM termination ({{{}System.exit(exitCode){}}}) so that the main
thread does not block infinitely on states that will never resolve.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]