Attila Zsolt Piros created SPARK-34361:
------------------------------------------
Summary: Dynamic allocation on K8s kills executors with running
tasks
Key: SPARK-34361
URL: https://issues.apache.org/jira/browse/SPARK-34361
Project: Spark
Issue Type: Bug
Components: Kubernetes
Affects Versions: 3.0.1, 3.0.0, 3.0.2, 3.1.0, 3.2.0, 3.1.1, 3.1.2
Reporter: Attila Zsolt Piros
There is race between executor POD allocator and cluster scheduler backend.
During downscaling (in dynamic allocation) we experienced a lot of killed new
executors with running task on them.
The pattern in the log is the following:
{noformat}
21/02/01 15:12:03 INFO ExecutorMonitor: New executor 312 has registered (new
total is 138)
...
21/02/01 15:12:03 INFO TaskSetManager: Starting task 247.0 in stage 4.0 (TID
2079, 100.100.18.138, executor 312, partition 247, PROCESS_LOCAL, 8777 bytes)
21/02/01 15:12:03 INFO ExecutorPodsAllocator: Deleting 3 excess pod requests
(408,312,307).
...
21/02/01 15:12:04 ERROR TaskSchedulerImpl: Lost executor 312 on 100.100.18.138:
The executor with id 312 was deleted by a user or the framework.
{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]