Sebastian Toader created SPARK-23980:
----------------------------------------

             Summary: Resilient Spark driver on Kubernetes
                 Key: SPARK-23980
                 URL: https://issues.apache.org/jira/browse/SPARK-23980
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.3.0
            Reporter: Sebastian Toader


The current implementation of `Spark driver` on Kubernetes is not resilient to 
node failures as it’s implemented as a `Pod`. In case of a node failure 
Kubernetes terminates the pods that were running on that node. Kubernetes 
doesn't reschedule these pods to any of the other nodes of the cluster.

If the `driver` is implemented as Kubernetes 
[Job|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/]
 than it will be rescheduled to other node.

When the driver is terminated its executors (that may run on other nodes) are 
terminated by Kubernetes with some delay by [Kubernetes Garbage 
collection|https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/].

This can lead to concurrency issues where the re-spawned `driver` was trying to 
create new executors with same name as the executors being in the middle of 
being cleaned up by Kubernetes garbage collection.

To solve this issue the executor name must be made unique for each `driver` 
*instance*.

The PR linked to this lira is an implementation of the above that creates spark 
driver as a Job and ensures that executor pod names are unique per driver 
instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to