[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373558#comment-16373558
 ] 

Stavros Kontopoulos edited comment on SPARK-23485 at 2/22/18 10:29 PM:
-----------------------------------------------------------------------

When an executor fails all cases are covered via handleDisconnectedExecutors 
which is scheduled at some rate and then it calls removeExecutor in 
CoarseGrainedSchedulerBackend which updates blacklist info. When we want to 
launch new executors, 
[CoarseGrainedSchedulerBackend|https://github.com/apache/spark/blob/f41c0a93fd3913ad93e55ddbfd875229872ecc97/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L172]
 will terminate an executor that is already started on a blacklisted node. IMHO 
kubernetes spark scheduler should fail fast and constraint where pods are 
launched on (which nodes) as it knows already that some nodes are no option. 
For example this could be done with this 
[feature|[https://kubernetes.io/docs/concepts/configuration/taint-and-toleration]
 ] also relates to ([https://github.com/kubernetes/kubernetes/issues/14573),] 
in cases liek node problems appear. 

 


was (Author: skonto):
When an executor fails all cases are covered via handleDisconnectedExecutors 
which is scheduled at some rate and then it calls removeExecutor in 
CoarseGrainedSchedulerBackend which updates blacklist info. When we want to 
launch new executors, TaskSchedulerImpl will terminate an executor that is 
already started on a blacklisted node. IMHO kubernetes spark scheduler should 
fail fast and constraint where pods are launched on (which nodes) as it knows 
already that some nodes are no option. For example this could be done with: 
https://kubernetes.io/docs/concepts/configuration/taint-and-toleration.. 

 

> Kubernetes should support node blacklist
> ----------------------------------------
>
>                 Key: SPARK-23485
>                 URL: https://issues.apache.org/jira/browse/SPARK-23485
>             Project: Spark
>          Issue Type: New Feature
>          Components: Kubernetes, Scheduler
>    Affects Versions: 2.3.0
>            Reporter: Imran Rashid
>            Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to