Dejiu Lu created SPARK-56235:
--------------------------------

             Summary: TaskSetManager.executorLost() O(N) scan over taskInfos 
causes DriverEndpoint thread stall with large task counts
                 Key: SPARK-56235
                 URL: https://issues.apache.org/jira/browse/SPARK-56235
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 4.1.1
            Reporter: Dejiu Lu


When a stage has a large number of tasks (e.g., 5 million), 
`TaskSetManager.executorLost()` becomes a severe bottleneck due to O(N) full 
scan over the `taskInfos` HashMap. Combined with batch executor removal from 
dynamic allocation, this blocks the `DriverEndpoint` single-threaded RPC 
processor for 100+ seconds, causing task status updates to be delayed and 
making the job appear stuck. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to