[
https://issues.apache.org/jira/browse/SPARK-34361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278932#comment-17278932
]
Attila Zsolt Piros commented on SPARK-34361:
--------------------------------------------
I am working on this.
> Dynamic allocation on K8s kills executors with running tasks
> ------------------------------------------------------------
>
> Key: SPARK-34361
> URL: https://issues.apache.org/jira/browse/SPARK-34361
> Project: Spark
> Issue Type: Bug
> Components: Kubernetes
> Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.2.0, 3.1.1, 3.1.2
> Reporter: Attila Zsolt Piros
> Priority: Major
>
> There is race between executor POD allocator and cluster scheduler backend.
> During downscaling (in dynamic allocation) we experienced a lot of killed new
> executors with running task on them.
> The pattern in the log is the following:
> {noformat}
> 21/02/01 15:12:03 INFO ExecutorMonitor: New executor 312 has registered (new
> total is 138)
> ...
> 21/02/01 15:12:03 INFO TaskSetManager: Starting task 247.0 in stage 4.0 (TID
> 2079, 100.100.18.138, executor 312, partition 247, PROCESS_LOCAL, 8777 bytes)
> 21/02/01 15:12:03 INFO ExecutorPodsAllocator: Deleting 3 excess pod requests
> (408,312,307).
> ...
> 21/02/01 15:12:04 ERROR TaskSchedulerImpl: Lost executor 312 on
> 100.100.18.138: The executor with id 312 was deleted by a user or the
> framework.
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]