[ https://issues.apache.org/jira/browse/SPARK-24105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797719#comment-16797719 ]
Kevin Hogeland edited comment on SPARK-24105 at 3/21/19 1:08 AM: ----------------------------------------------------------------- [~vanzin] Why was this marked "Won't Fix"? This is a major issue. * There are a limited amount of resources (constrained either by a ResourceQuota or by the size of the cluster) * Drivers are scheduled before executors due to the 2-layer scheduling design * Drivers consume from the same pool of resources as executors, making it possible to consume all available resources * If no driver can schedule an executor, all drivers are stalled indefinitely (even if they timeout and crash) Starting too many drivers at the same time _will_ cause a deadlock. Any spiky workload is very likely to trigger this eventually. For example, if a large amount of Spark jobs are scheduled daily/hourly. We've been able to reproduce this easily in testing. was (Author: hogeland): [~vanzin] Why was this marked "Won't Fix"? This is a major issue. * There is a limited amount of resources (constrained either by a ResourceQuota or by the size of the cluster) * Drivers are scheduled before executors due to the 2-layer scheduling design * Drivers consume from the same pool of resources as executors, making it possible to consume all available resources * If no driver can schedule an executor, all drivers are stalled indefinitely (even if they timeout and crash) Starting too many drivers at the same time _will_ cause a deadlock. Any spiky workload is very likely to trigger this eventually. For example, if a large amount of Spark jobs are scheduled daily/hourly. We've been able to reproduce this easily in testing. > Spark 2.3.0 on kubernetes > ------------------------- > > Key: SPARK-24105 > URL: https://issues.apache.org/jira/browse/SPARK-24105 > Project: Spark > Issue Type: Improvement > Components: Kubernetes > Affects Versions: 2.3.0 > Reporter: Lenin > Priority: Major > > Right now its only possible to define node selector configurations > thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver > & executor pods. Without the capability to isolate driver & executor pods, > the cluster can run into a livelock scenario, where if there are a lot of > spark submits, can cause the driver pods to fill up the cluster capacity, > with no room for executor pods to do any work. > > To avoid this deadlock, its required to support node selector (in future > affinity/anti-affinity) configruation by driver & executor. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org