GitHub user baluchicken opened a pull request:

    https://github.com/apache/spark/pull/21067

    [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

    ## What changes were proposed in this pull request?
    
    The current implementation of `Spark driver` on Kubernetes is not resilient 
to node failures as it’s implemented as a `Pod`. In case of a node failure 
Kubernetes terminates the pods that were running on that node. Kubernetes 
doesn't reschedule these pods to any of the other nodes of the cluster.
    If the `driver` is implemented as Kubernetes Job than it will be 
rescheduled to other node.
    When the driver is terminated its executors (that may run on other nodes) 
are terminated by Kubernetes with some delay by Kubernetes Garbage collection.
    This can lead to concurrency issues where the re-spawned `driver` was 
trying to create new executors with same name as the executors being in the 
middle of being cleaned up by Kubernetes garbage collection.
    To solve this issue the executor name must be made unique for each `driver` 
instance.
    For example: 
    `networkwordcount-1519301591265-usmj-exec-1`
    
    ## How was this patch tested?
    
    This patch was tested manually.
    Submitted a Spark application to a cluster with three node:
    
    ```
    kubectl get jobs
    NAME                                    DESIRED   SUCCESSFUL   AGE
    networkwordcount-1519301591265-driver   1         0            3m
    ```
    
    ```
    kubectl get pods
    NAME                                          READY     STATUS    RESTARTS  
 AGE
    networkwordcount-1519301591265-driver-mszl2   1/1       Running   0         
 3m
    networkwordcount-1519301591265-usmj-exec-1    1/1       Running   0         
 1m
    networkwordcount-1519301591265-usmj-exec-2    1/1       Running   0         
 1m
    ```
    
    Spark driver `networkwordcount-1519301591265-driver` is a Kubernetes Job, 
that manages the `networkwordcount-1519301591265-driver-mszl2` pod.
    
    Shutted down the node where the driver pod is running
    
    ```
    kubectl get pods
    NAME                                          READY     STATUS     RESTARTS 
  AGE
    networkwordcount-1519301591265-driver-dwvkf   1/1       Running   0         
 3m
    networkwordcount-1519301591265-rmes-exec-1    1/1       Running   0         
 1m
    networkwordcount-1519301591265-rmes-exec-2    1/1       Running   0         
 1m
    ```
    
    The spark driver kubernetes job rescheduled the driver pod as 
`networkwordcount-1519301591265-driver-dwvkf.
    `
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/banzaicloud/spark SPARK-23980

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21067.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21067
    
----
commit 0ac938e9e942023b0a6dfb0c2ffdd7dc543e5084
Author: Balint Molnar <balintmolnar91@...>
Date:   2018-04-13T12:22:00Z

    [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to