Oz Ben-Ami created SPARK-24122:
----------------------------------
Summary: Allow automatic driver restarts on K8s
Key: SPARK-24122
URL: https://issues.apache.org/jira/browse/SPARK-24122
Project: Spark
Issue Type: Improvement
Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Oz Ben-Ami
[~foxish]
Right now SparkSubmit creates the driver as a bare pod, rather than a managed
controller like a Deployment or a StatefulSet. This means there is no way to
guarantee automatic restarts, eg in case a node has an issue. Note Pod
RestartPolicy does not apply if a node fails. A StatefulSet would allow us to
guarantee that, and keep the ability for executors to find the driver using DNS.
This is particularly helpful for long-running streaming workloads, where we
currently use {{yarn.resourcemanager.am.max-attempts}} with YARN. I can confirm
that Spark Streaming and Structured Streaming applications can be made to
recover from such a restart, with the help of checkpointing. The executors will
have to be started again by the driver, but this should not be a problem.
For batch processing, we could alternatively use Kubernetes {{Job}} objects,
which restart pods on failure but not success. For example, note the semantics
provided by the {{kubectl run}}
[command|https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run]
* {{--restart=Never}}: bare Pod
* {{--restart=Always}}: Deployment
* {{--restart=OnFailure}}: Job
https://github.com/apache-spark-on-k8s/spark/issues/288
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]