Amruth Ashok created SPARK-55661:
------------------------------------
Summary: TaskRunner.run() setup failure silently leaks driver-side
resources (cores/GPUs), causing permanent scheduling starvation
Key: SPARK-55661
URL: https://issues.apache.org/jira/browse/SPARK-55661
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.5.0
Environment: This was observed on a Databricks cluster with the
following conditions:
* Single-GPU node (spark.task.resource.gpu.amount=1)
* AQE enabled (default)
* A left_anti join followed by toPandas() triggered AQE re-optimization
* AQE cancelled and re-submitted stages in rapid succession (~25ms between
dispatch and cancellation)
* A task was dispatched via LaunchTask and then killed via KillTask so quickly
that the executor's TaskRunner.run() was interrupted during setup
* The executor logged Got assigned task 48, but never logged Running task 48
or Finished task 48, or Executor killed task 48
* The GPU resource was permanently leaked, preventing the replacement stage
(Stage 21) from ever being scheduled
* The DeadlockDetector fired DAG_SCHEDULER_NO_ACTIVE_TASK every 5 minutes for
hours until manual cancellation
Reporter: Amruth Ashok
If TaskRunner.run() throws an exception during its setup phase, before reaching
the inner try/catch/finally block, no StatusUpdate is sent to the driver, and
runningTasks is never cleaned up.
The driver's CoarseGrainedSchedulerBackend acquires resources (CPU cores, GPU
slots) in launchTasks() but only releases them when a
StatusUpdate(FINISHED|FAILED|KILLED) arrives.
A missing StatusUpdate permanently leaks those resources in executorDataMap,
making them unavailable for future task scheduling. On a resource-constrained
cluster (e.g., single-GPU node with spark.task.resource.gpu.amount=1), this
causes complete scheduling starvation, and no further tasks can ever be
launched, and the job hangs indefinitely.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]