Dejiu Lu created SPARK-56235:
--------------------------------
Summary: TaskSetManager.executorLost() O(N) scan over taskInfos
causes DriverEndpoint thread stall with large task counts
Key: SPARK-56235
URL: https://issues.apache.org/jira/browse/SPARK-56235
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 4.1.1
Reporter: Dejiu Lu
When a stage has a large number of tasks (e.g., 5 million),
`TaskSetManager.executorLost()` becomes a severe bottleneck due to O(N) full
scan over the `taskInfos` HashMap. Combined with batch executor removal from
dynamic allocation, this blocks the `DriverEndpoint` single-threaded RPC
processor for 100+ seconds, causing task status updates to be delayed and
making the job appear stuck.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]