[ 
https://issues.apache.org/jira/browse/SPARK-24105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797719#comment-16797719
 ] 

Kevin Hogeland edited comment on SPARK-24105 at 3/21/19 1:08 AM:
-----------------------------------------------------------------

[~vanzin] Why was this marked "Won't Fix"? This is a major issue.
 * There are a limited amount of resources (constrained either by a 
ResourceQuota or by the size of the cluster)
 * Drivers are scheduled before executors due to the 2-layer scheduling design
 * Drivers consume from the same pool of resources as executors, making it 
possible to consume all available resources
 * If no driver can schedule an executor, all drivers are stalled indefinitely 
(even if they timeout and crash)

Starting too many drivers at the same time _will_ cause a deadlock. Any spiky 
workload is very likely to trigger this eventually. For example, if a large 
amount of Spark jobs are scheduled daily/hourly. We've been able to reproduce 
this easily in testing.


was (Author: hogeland):
[~vanzin] Why was this marked "Won't Fix"? This is a major issue.
 * There is a limited amount of resources (constrained either by a 
ResourceQuota or by the size of the cluster)
 * Drivers are scheduled before executors due to the 2-layer scheduling design
 * Drivers consume from the same pool of resources as executors, making it 
possible to consume all available resources
 * If no driver can schedule an executor, all drivers are stalled indefinitely 
(even if they timeout and crash)

Starting too many drivers at the same time _will_ cause a deadlock. Any spiky 
workload is very likely to trigger this eventually. For example, if a large 
amount of Spark jobs are scheduled daily/hourly. We've been able to reproduce 
this easily in testing.

> Spark 2.3.0 on kubernetes
> -------------------------
>
>                 Key: SPARK-24105
>                 URL: https://issues.apache.org/jira/browse/SPARK-24105
>             Project: Spark
>          Issue Type: Improvement
>          Components: Kubernetes
>    Affects Versions: 2.3.0
>            Reporter: Lenin
>            Priority: Major
>
> Right now its only possible to define node selector configurations 
> thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver 
> & executor pods. Without the capability to isolate driver & executor pods, 
> the cluster can run into a livelock scenario, where if there are a lot of 
> spark submits, can cause the driver pods to fill up the cluster capacity, 
> with no room for executor pods to do any work.
>  
> To avoid this deadlock, its required to support node selector (in future 
> affinity/anti-affinity) configruation by driver & executor.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to