[ 
https://issues.apache.org/jira/browse/SPARK-26423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Vogelbacher updated SPARK-26423:
--------------------------------------
    Description: 
If an executor disconnects we currently only disable it in the 
{{KubernetesClusterSchedulerBackend}} but don't take any further action - in 
the expectation all the other necessary actions (deleting it from spark, 
requesting a new replacement executor, ...) will be driven by k8s lifecycle 
events.
However, this only works if the reason that the executor disconnected is that 
the executor pod is dying/shutting down/...
It doesn't work if there is just some network issue between driver and executor 
(but the executor pod is still running in k8s and keeps running).
Thus (as indicated in the TODO comment in 
[KubernetesClusterSchedulerBackend|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L158]),
 we should make sure that a disconnected executor eventually does get killed in 
k8s.



  was:
If an executor disconnects we currently only disable it in the 
{{KubernetesClusterSchedulerBackend}} but don't take any further action - in 
the expectation all the other necessary actions (deleting it from spark, 
requesting a new replacement executor, ...) will be driven by k8s lifecycle 
events.
However, this only works if the reason that the executor disconnected is that 
the executor pod is dying/shutting down/...
It doesn't work if there is just some network issue between driver and executor 
(but the executor pod is still running in k8s and keeps running).
Thus (as indicated in the TODO comment in 
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L158),
 we should make sure that a disconnected executor eventually does get killed in 
k8s.




> [K8s] Make sure that disconnected executors eventually get deleted
> ------------------------------------------------------------------
>
>                 Key: SPARK-26423
>                 URL: https://issues.apache.org/jira/browse/SPARK-26423
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.4.0
>            Reporter: David Vogelbacher
>            Priority: Major
>
> If an executor disconnects we currently only disable it in the 
> {{KubernetesClusterSchedulerBackend}} but don't take any further action - in 
> the expectation all the other necessary actions (deleting it from spark, 
> requesting a new replacement executor, ...) will be driven by k8s lifecycle 
> events.
> However, this only works if the reason that the executor disconnected is that 
> the executor pod is dying/shutting down/...
> It doesn't work if there is just some network issue between driver and 
> executor (but the executor pod is still running in k8s and keeps running).
> Thus (as indicated in the TODO comment in 
> [KubernetesClusterSchedulerBackend|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L158]),
>  we should make sure that a disconnected executor eventually does get killed 
> in k8s.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to