[ https://issues.apache.org/jira/browse/SPARK-23980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marcelo Vanzin reassigned SPARK-23980: -------------------------------------- Assignee: (was: Marcelo Vanzin) > Resilient Spark driver on Kubernetes > ------------------------------------ > > Key: SPARK-23980 > URL: https://issues.apache.org/jira/browse/SPARK-23980 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.3.0 > Reporter: Sebastian Toader > Priority: Major > > The current implementation of `Spark driver` on Kubernetes is not resilient > to node failures as it’s implemented as a `Pod`. In case of a node failure > Kubernetes terminates the pods that were running on that node. Kubernetes > doesn't reschedule these pods to any of the other nodes of the cluster. > If the `driver` is implemented as Kubernetes > [Job|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/] > than it will be rescheduled to other node. > When the driver is terminated its executors (that may run on other nodes) are > terminated by Kubernetes with some delay by [Kubernetes Garbage > collection|https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/]. > This can lead to concurrency issues where the re-spawned `driver` was trying > to create new executors with same name as the executors being in the middle > of being cleaned up by Kubernetes garbage collection. > To solve this issue the executor name must be made unique for each `driver` > *instance*. > The PR linked to this lira is an implementation of the above that creates > spark driver as a Job and ensures that executor pod names are unique per > driver instance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org