GitHub user baluchicken opened a pull request:
https://github.com/apache/spark/pull/21067
[SPARK-23980][K8S] Resilient Spark driver on Kubernetes
## What changes were proposed in this pull request?
The current implementation of `Spark driver` on Kubernetes is not resilient
to node failures as itâs implemented as a `Pod`. In case of a node failure
Kubernetes terminates the pods that were running on that node. Kubernetes
doesn't reschedule these pods to any of the other nodes of the cluster.
If the `driver` is implemented as Kubernetes Job than it will be
rescheduled to other node.
When the driver is terminated its executors (that may run on other nodes)
are terminated by Kubernetes with some delay by Kubernetes Garbage collection.
This can lead to concurrency issues where the re-spawned `driver` was
trying to create new executors with same name as the executors being in the
middle of being cleaned up by Kubernetes garbage collection.
To solve this issue the executor name must be made unique for each `driver`
instance.
For example:
`networkwordcount-1519301591265-usmj-exec-1`
## How was this patch tested?
This patch was tested manually.
Submitted a Spark application to a cluster with three node:
```
kubectl get jobs
NAME DESIRED SUCCESSFUL AGE
networkwordcount-1519301591265-driver 1 0 3m
```
```
kubectl get pods
NAME READY STATUS RESTARTS
AGE
networkwordcount-1519301591265-driver-mszl2 1/1 Running 0
3m
networkwordcount-1519301591265-usmj-exec-1 1/1 Running 0
1m
networkwordcount-1519301591265-usmj-exec-2 1/1 Running 0
1m
```
Spark driver `networkwordcount-1519301591265-driver` is a Kubernetes Job,
that manages the `networkwordcount-1519301591265-driver-mszl2` pod.
Shutted down the node where the driver pod is running
```
kubectl get pods
NAME READY STATUS RESTARTS
AGE
networkwordcount-1519301591265-driver-dwvkf 1/1 Running 0
3m
networkwordcount-1519301591265-rmes-exec-1 1/1 Running 0
1m
networkwordcount-1519301591265-rmes-exec-2 1/1 Running 0
1m
```
The spark driver kubernetes job rescheduled the driver pod as
`networkwordcount-1519301591265-driver-dwvkf.
`
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/banzaicloud/spark SPARK-23980
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21067.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21067
----
commit 0ac938e9e942023b0a6dfb0c2ffdd7dc543e5084
Author: Balint Molnar <balintmolnar91@...>
Date: 2018-04-13T12:22:00Z
[SPARK-23980][K8S] Resilient Spark driver on Kubernetes
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]