Amruth Ashok created SPARK-55661:
------------------------------------

             Summary: TaskRunner.run() setup failure silently leaks driver-side 
resources (cores/GPUs), causing permanent scheduling starvation
                 Key: SPARK-55661
                 URL: https://issues.apache.org/jira/browse/SPARK-55661
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.5.0
         Environment: This was observed on a Databricks cluster with the 
following conditions:
 * Single-GPU node (spark.task.resource.gpu.amount=1)

 * AQE enabled (default)

 * A left_anti join followed by toPandas() triggered AQE re-optimization

 * AQE cancelled and re-submitted stages in rapid succession (~25ms between 
dispatch and cancellation)

 * A task was dispatched via LaunchTask and then killed via KillTask so quickly 
that the executor's TaskRunner.run() was interrupted during setup

 * The executor logged Got assigned task 48, but never logged Running task 48 
or Finished task 48, or Executor killed task 48

 * The GPU resource was permanently leaked, preventing the replacement stage 
(Stage 21) from ever being scheduled

 * The DeadlockDetector fired DAG_SCHEDULER_NO_ACTIVE_TASK every 5 minutes for 
hours until manual cancellation
            Reporter: Amruth Ashok


If TaskRunner.run() throws an exception during its setup phase, before reaching 
the inner try/catch/finally block, no StatusUpdate is sent to the driver, and 
runningTasks is never cleaned up.

The driver's CoarseGrainedSchedulerBackend acquires resources (CPU cores, GPU 
slots) in launchTasks() but only releases them when a 
StatusUpdate(FINISHED|FAILED|KILLED) arrives.

A missing StatusUpdate permanently leaks those resources in executorDataMap, 
making them unavailable for future task scheduling. On a resource-constrained 
cluster (e.g., single-GPU node with spark.task.resource.gpu.amount=1), this 
causes complete scheduling starvation, and no further tasks can ever be 
launched, and the job hangs indefinitely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to