[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373558#comment-16373558
 ] 

Stavros Kontopoulos edited comment on SPARK-23485 at 2/22/18 10:32 PM:
-----------------------------------------------------------------------

When an executor fails all cases are covered via handleDisconnectedExecutors 
which is scheduled at some rate and then it calls removeExecutor in 
CoarseGrainedSchedulerBackend which updates blacklist info. When we want to 
launch new executors, 
[CoarseGrainedSchedulerBackend|https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L172]
 will terminate an executor that is already started on a blacklisted node. IMHO 
kubernetes spark scheduler should filter nodes and constraint where pods are 
launched on, as it knows already that some nodes are no option. For example 
this could be done with this 
[feature|https://kubernetes.io/docs/concepts/configuration/taint-and-toleration],
 also relates to ([https://github.com/kubernetes/kubernetes/issues/14573),] in 
cases like where node problems appear. 

 


was (Author: skonto):
When an executor fails all cases are covered via handleDisconnectedExecutors 
which is scheduled at some rate and then it calls removeExecutor in 
CoarseGrainedSchedulerBackend which updates blacklist info. When we want to 
launch new executors, 
[CoarseGrainedSchedulerBackend|https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L172]
 will terminate an executor that is already started on a blacklisted node. IMHO 
kubernetes spark scheduler should fail fast and constraint where pods are 
launched on (which nodes) as it knows already that some nodes are no option. 
For example this could be done with this 
[feature|[https://kubernetes.io/docs/concepts/configuration/taint-and-toleration]
 ] also relates to ([https://github.com/kubernetes/kubernetes/issues/14573),] 
in cases liek node problems appear. 

 

> Kubernetes should support node blacklist
> ----------------------------------------
>
>                 Key: SPARK-23485
>                 URL: https://issues.apache.org/jira/browse/SPARK-23485
>             Project: Spark
>          Issue Type: New Feature
>          Components: Kubernetes, Scheduler
>    Affects Versions: 2.3.0
>            Reporter: Imran Rashid
>            Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to