[jira] [Created] (SPARK-56235) TaskSetManager.executorLost() O(N) scan over taskInfos causes DriverEndpoint thread stall with large task counts

Dejiu Lu (Jira) Wed, 25 Mar 2026 22:30:19 -0700

Dejiu Lu created SPARK-56235:
--------------------------------

             Summary: TaskSetManager.executorLost() O(N) scan over taskInfos 
causes DriverEndpoint thread stall with large task counts
                 Key: SPARK-56235
                 URL: https://issues.apache.org/jira/browse/SPARK-56235
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 4.1.1
            Reporter: Dejiu Lu



When a stage has a large number of tasks (e.g., 5 million), 
`TaskSetManager.executorLost()` becomes a severe bottleneck due to O(N) full 
scan over the `taskInfos` HashMap. Combined with batch executor removal from 
dynamic allocation, this blocks the `DriverEndpoint` single-threaded RPC 
processor for 100+ seconds, causing task status updates to be delayed and 
making the job appear stuck. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-56235) TaskSetManager.executorLost() O(N) scan over taskInfos causes DriverEndpoint thread stall with large task counts

Reply via email to