[ https://issues.apache.org/jira/browse/SPARK-52752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-52752: ----------------------------------- Labels: pull-request-available (was: ) > Executor may be killed before task is finished due to DRA idle timedout > ----------------------------------------------------------------------- > > Key: SPARK-52752 > URL: https://issues.apache.org/jira/browse/SPARK-52752 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.5.0, 4.0.1 > Reporter: Zhen Wang > Priority: Major > Labels: pull-request-available > Attachments: image-2025-07-10-18-21-36-377.png, > image-2025-07-10-18-21-57-192.png > > > I found some failed tasks with the reason "executor 260 exited unrelated to > the running tasks", but from the executor log I saw that it had run > successfully. > !image-2025-07-10-18-21-57-192.png! > I sorted out the relevant logs of this issue: > Executor 260 task scheduler logs: > {code:java} > 25/07/10 17:54:14 INFO Executor: Finished task 193.0 in stage 462.0 (TID > 23745). 187203 bytes result sent to driver > 25/07/10 17:55:08 INFO YarnCoarseGrainedExecutorBackend: Got assigned task > 25061 > 25/07/10 17:55:08 INFO Executor: Running task 274.0 in stage 639.0 (TID 25061) > ...... > 25/07/10 17:55:14 INFO MemoryStore: Block taskresult_25061 stored as bytes in > memory (estimated size 1821.7 KiB, free 1477.0 MiB) > 25/07/10 17:55:14 INFO Executor: Finished task 274.0 in stage 639.0 (TID > 25061). 1865404 bytes result sent via BlockManager) > 25/07/10 17:55:17 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM > 25/07/10 17:55:17 INFO MemoryStore: MemoryStore cleared > 25/07/10 17:55:17 INFO BlockManager: BlockManager stopped > 25/07/10 17:55:17 INFO ShutdownHookManager: Shutdown hook called {code} > AsyncEventQueue drops some Events: > {code:java} > 25/07/10 17:53:15 ERROR AsyncEventQueue: Dropping event from queue eventLog. > This likely means one of the listeners is too slow and cannot keep up with > the rate at which tasks are being started by the scheduler. {code} > The last ExecutorMonitor$Tracker log in Driver: > {code:java} > 25/07/10 17:54:16 INFO ExecutorMonitor$Tracker: Updating timeout for executor > 260, delta: -1 > 25/07/10 17:54:16 INFO ExecutorMonitor$Tracker: Updating timeout for executor > 260 to 100306447198500258 ns {code} > Executor 260 killed due to idle timedout log in Driver: > {code:java} > 25/07/10 17:55:16 INFO YarnClusterSchedulerBackend: Requesting to kill > executor(s) 260, 423 > 25/07/10 17:55:16 INFO YarnClusterSchedulerBackend: Actual list of > executor(s) to be killed is 260 > 25/07/10 17:55:16 INFO ApplicationMaster$AMEndpoint: Driver requested to kill > executor(s) 260. > 25/07/10 17:55:16 INFO ExecutorAllocationManager: Executors 260 removed due > to idle timeout. {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org