Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21067#discussion_r194558081
--- Diff:
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala
---
@@ -59,16 +59,18 @@ private[spark] class KubernetesClusterSchedulerBackend(
private val kubernetesNamespace = conf.get(KUBERNETES_NAMESPACE)
- private val kubernetesDriverPodName = conf
- .get(KUBERNETES_DRIVER_POD_NAME)
- .getOrElse(throw new SparkException("Must specify the driver pod
name"))
+ private val kubernetesDriverJobName = conf
+ .get(KUBERNETES_DRIVER_JOB_NAME)
+ .getOrElse(throw new SparkException("Must specify the driver job
name"))
private implicit val requestExecutorContext =
ExecutionContext.fromExecutorService(
requestExecutorsService)
- private val driverPod = kubernetesClient.pods()
- .inNamespace(kubernetesNamespace)
- .withName(kubernetesDriverPodName)
- .get()
+ private val driverPod: Pod = {
+ val pods = kubernetesClient.pods()
+ .inNamespace(kubernetesNamespace).withLabel("job-name",
kubernetesDriverJobName).list()
--- End diff --
> Pod fails: Job will recreate a new Driver Pod to replace the failed one.
There will be only one Driver pod because the failed one will be removed by the
Kubernetes garbage collector.
For this one I'm not sure if the GC is done immediately. If that's part of
the Kubernetes contract, then we're fine, if not, then we can't make any
assumptions.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]