Github user foxish commented on a diff in the pull request:
https://github.com/apache/spark/pull/19946#discussion_r156683182
--- Diff: docs/running-on-kubernetes.md ---
@@ -0,0 +1,498 @@
+---
+layout: global
+title: Running Spark on Kubernetes
+---
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+Spark can run on clusters managed by [Kubernetes](https://kubernetes.io).
This feature makes use of the new experimental native
+Kubernetes scheduler that has been added to Spark.
+
+# Prerequisites
+
+* A runnable distribution of Spark 2.3 or above.
+* A running Kubernetes cluster at version >= 1.6 with access configured to
it using
+[kubectl](https://kubernetes.io/docs/user-guide/prereqs/). If you do not
already have a working Kubernetes cluster,
+you may setup a test cluster on your local machine using
+[minikube](https://kubernetes.io/docs/getting-started-guides/minikube/).
+ * We recommend using the latest releases of minikube be updated to the
most recent version with the DNS addon enabled.
+* You must have appropriate permissions to list, create, edit and delete
+[pods](https://kubernetes.io/docs/user-guide/pods/) in your cluster. You
can verify that you can list these resources
+by running `kubectl auth can-i <list|create|edit|delete> pods`.
+ * The service account credentials used by the driver pods must be
allowed to create pods, services and configmaps.
+* You must have [Kubernetes
DNS](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/)
configured in your cluster.
+
+# How it works
+
+<p style="text-align: center;">
+ <img src="img/k8s-cluster-mode.png" title="Spark cluster components"
alt="Spark cluster components" />
+</p>
+
+spark-submit can be directly used to submit a Spark application to a
Kubernetes cluster. The mechanism by which spark-submit happens is as follows:
+
+* Spark creates a spark driver running within a [Kubernetes
pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/).
+* The driver creates executors which are also running within Kubernetes
pods and connects to them, and executes application code.
+* When the application completes, the executor pods terminate and are
cleaned up, but the driver pod persists
+logs and remains in "completed" state in the Kubernetes API till it's
eventually garbage collected or manually cleaned up.
+
+Note that in the completed state, the driver pod does *not* use any
computational or memory resources.
+
+The driver and executor pod scheduling is handled by Kubernetes. It will
be possible to affect Kubernetes scheduling
+decisions for driver and executor pods using advanced primitives like
+[node
selectors](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector)
+and [node/pod
affinities](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity)
+in a future release.
+
+# Submitting Applications to Kubernetes
+
+## Docker Images
+
+Kubernetes requires users to supply images that can be deployed into
containers within pods. The images are built to
+be run in a container runtime environment that Kubernetes supports. Docker
is a container runtime environment that is
+frequently used with Kubernetes. With Spark 2.3, there are Dockerfiles
provided in the runnable distribution that can be customized
+and built for your usage.
+
+You may build these docker images from sources.
+There is a script, `sbin/build-push-docker-images.sh` that you can use to
build and push
+customized spark distribution images consisting of all the above
components.
+
+Example usage is:
+
+ ./sbin/build-push-docker-images.sh -r <repo> -t my-tag build
+ ./sbin/build-push-docker-images.sh -r <repo> -t my-tag push
+
+Docker files are under the `dockerfiles/` and can be customized further
before
+building using the supplied script, or manually.
+
+## Cluster Mode
+
+To launch Spark Pi in cluster mode,
+
+{% highlight bash %}
+$ bin/spark-submit \
+ --deploy-mode cluster \
+ --class org.apache.spark.examples.SparkPi \
+ --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
+ --conf spark.kubernetes.namespace=default \
+ --conf spark.executor.instances=5 \
+ --conf spark.app.name=spark-pi \
+ --conf spark.kubernetes.driver.docker.image=<driver-image> \
+ --conf spark.kubernetes.executor.docker.image=<executor-image> \
+ local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
+{% endhighlight %}
+
+The Spark master, specified either via passing the `--master` command line
argument to `spark-submit` or by setting
+`spark.master` in the application's configuration, must be a URL with the
format `k8s://<api_server_url>`. Prefixing the
+master string with `k8s://` will cause the Spark application to launch on
the Kubernetes cluster, with the API server
+being contacted at `api_server_url`. If no HTTP protocol is specified in
the URL, it defaults to `https`. For example,
+setting the master to `k8s://example.com:443` is equivalent to setting it
to `k8s://https://example.com:443`, but to
+connect without TLS on a different port, the master would be set to
`k8s://http://example.com:8080`.
+
+If you have a Kubernetes cluster setup, one way to discover the apiserver
URL is by executing `kubectl cluster-info`.
+
+```bash
+kubectl cluster-info
+Kubernetes master is running at http://127.0.0.1:6443
+```
+
+In the above example, the specific Kubernetes cluster can be used with
spark submit by specifying
+`--master k8s://http://127.0.0.1:6443` as an argument to spark-submit.
Additionally, it is also possible to use the
+authenticating proxy, `kubectl proxy` to communicate to the Kubernetes API.
+
+The local proxy can be started by:
+
+```bash
+ kubectl proxy
+```
+
+If the local proxy is running at localhost:8001, `--master
k8s://http://127.0.0.1:8001` can be used as the argument to
+spark-submit. Finally, notice that in the above example we specify a jar
with a specific URI with a scheme of `local://`.
+This URI is the location of the example jar that is already in the Docker
image.
+
+## Dependency Management
+
+If your application's dependencies are all hosted in remote locations like
HDFS or http servers, they may be referred to
+by their appropriate remote URIs. Also, application dependencies can be
pre-mounted into custom-built Docker images.
+Those dependencies can be added to the classpath by referencing them with
`local://` URIs and/or setting the
+`SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles.
+
+## Introspection and Debugging
+
+These are the different ways in which you can investigate a
running/completed Spark application, monitor progress, and
+take actions.
+
+### Accessing Logs
+
+Logs can be accessed using the kubernetes API and the `kubectl` CLI. When
a Spark application is running, it's possible
+to stream logs from the application using:
+
+```bash
+kubectl -n=<namespace> logs -f <driver-pod-name>
+```
+
+The same logs can also be accessed through the
+[kubernetes
dashboard](https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/)
if installed on
+the cluster.
+
+### Accessing Driver UI
+
+The UI associated with any application can be accessed locally using
+[`kubectl
port-forward`](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/#forward-a-local-port-to-a-port-on-the-pod).
+
+```bash
+kubectl port-forward <driver-pod-name> 4040:4040
+```
+
+Then, the spark driver UI can be accessed on `http://localhost:4040`.
+
+### Debugging
+
+There may be several kinds of failures. If the Kubernetes API server
rejects the request made from spark-submit, or the
+connection is refused for a different reason, the submission logic should
indicate the error encountered. However, if there
+are errors during the running of the application, often, the best way to
investigate may be through the kubernetes CLI.
+
+To get some basic information about the scheduling decisions made around
the driver pod, you can run:
+
+```bash
+kubectl describe pod <spark-driver-pod>
+```
+
+If the pod has encountered a runtime error, the status can be probed
further using:
+
+```bash
+kubectl logs <spark-driver-pod>
+```
+
+Status and logs of failed executor pods can be checked in similar ways.
Finally, deleting the driver pod will clean up the entire spark
+application, includling all executors, associated service, etc. The driver
pod can be thought of as the Kubernetes representation of
+the spark application.
+
+## Kubernetes Features
+
+### Namespaces
+
+Kubernetes has the concept of
[namespaces](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/).
+Namespaces are ways to divide cluster resources between multiple users
(via resource quota). Spark on Kubernetes can
+use namespaces to launch spark applications. This is through the `--conf
spark.kubernetes.namespace` argument to spark-submit.
+
+Kubernetes allows using
[ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/) to
set limits on
+resources, number of objects, etc on individual namespaces. Namespaces and
ResourceQuota can be used in combination by
+administrator to control sharing and resource allocation in a Kubernetes
cluster running Spark applications.
+
+### RBAC
+
+In Kubernetes clusters with
[RBAC](https://kubernetes.io/docs/admin/authorization/rbac/) enabled, users can
configure
+Kubernetes RBAC roles and service accounts used by the various Spark on
Kubernetes components to access the Kubernetes
+API server.
+
+The Spark driver pod uses a Kubernetes service account to access the
Kubernetes API server to create and watch executor
+pods. The service account used by the driver pod must have the appropriate
permission for the driver to be able to do
+its work. Specifically, at minimum, the service account must be granted a
+[`Role` or
`ClusterRole`](https://kubernetes.io/docs/admin/authorization/rbac/#role-and-clusterrole)
that allows driver
+pods to create pods and services. By default, the driver pod is
automatically assigned the `default` service account in
+the namespace specified by `spark.kubernetes.namespace`, if no service
account is specified when the pod gets created.
+
+Depending on the version and setup of Kubernetes deployed, this `default`
service account may or may not have the role
+that allows driver pods to create pods and services under the default
Kubernetes
+[RBAC](https://kubernetes.io/docs/admin/authorization/rbac/) policies.
Sometimes users may need to specify a custom
+service account that has the right role granted. Spark on Kubernetes
supports specifying a custom service account to
+be used by the driver pod through the configuration property
+`spark.kubernetes.authenticate.driver.serviceAccountName=<service account
name>`. For example to make the driver pod
+to use the `spark` service account, a user simply adds the following
option to the `spark-submit` command:
+
+```
+--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
+```
+
+To create a custom service account, a user can use the `kubectl create
serviceaccount` command. For example, the
+following command creates a service account named `spark`:
+
+```bash
+kubectl create serviceaccount spark
+```
+
+To grant a service account a `Role` or `ClusterRole`, a `RoleBinding` or
`ClusterRoleBinding` is needed. To create
+a `RoleBinding` or `ClusterRoleBinding`, a user can use the `kubectl
create rolebinding` (or `clusterrolebinding`
+for `ClusterRoleBinding`) command. For example, the following command
creates an `edit` `ClusterRole` in the `default`
+namespace and grants it to the `spark` service account created above:
+
+```bash
+kubectl create clusterrolebinding spark-role --clusterrole=edit
--serviceaccount=default:spark --namespace=default
+```
+
+Note that a `Role` can only be used to grant access to resources (like
pods) within a single namespace, whereas a
+`ClusterRole` can be used to grant access to cluster-scoped resources
(like nodes) as well as namespaced resources
+(like pods) across all namespaces. For Spark on Kubernetes, since the
driver always creates executor pods in the
+same namespace, a `Role` is sufficient, although users may use a
`ClusterRole` instead. For more information on
+RBAC authorization and how to configure Kubernetes service accounts for
pods, please refer to
+[Using RBAC
Authorization](https://kubernetes.io/docs/admin/authorization/rbac/) and
+[Configure Service Accounts for
Pods](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/).
+
+## Client Mode
+
+Client mode is not currently supported.
+
+## Future Work
+
+There are several Spark on Kubernetes features that are currently being
incubated in a fork -
+[apache-spark-on-k8s/spark](https://github.com/apache-spark-on-k8s/spark),
which are expected to eventually make it into
+future versions of the spark-kubernetes integration.
+
+Some of these include:
+
+* PySpark
+* R
+* Dynamic Executor Scaling
+* Local File Dependency Management
+* Spark Application Management
+* Job Queues and Resource Management
+
+You can refer to the
[documentation](https://apache-spark-on-k8s.github.io/userdocs/) if you want to
try these features
+and provide feedback to the development team.
+
+# Configuration
+
+See the [configuration page](configuration.html) for information on Spark
configurations. The following configurations are
+specific to Spark on Kubernetes.
+
+#### Spark Properties
+
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+<tr>
+ <td><code>spark.kubernetes.namespace</code></td>
+ <td><code>default</code></td>
+ <td>
+ The namespace that will be used for running the driver and executor
pods.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.driver.docker.image</code></td>
+ <td><code>(none)</code></td>
+ <td>
+ Docker image to use for the driver. Specify this using the standard
+ <a
href="https://docs.docker.com/engine/reference/commandline/tag/">Docker tag</a>
format.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.executor.docker.image</code></td>
+ <td><code>(none)</code></td>
+ <td>
+ Docker image to use for the executors. Specify this using the standard
+ <a
href="https://docs.docker.com/engine/reference/commandline/tag/">Docker tag</a>
format.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.allocation.batch.size</code></td>
+ <td><code>5</code></td>
+ <td>
+ Number of pods to launch at once in each round of executor pod
allocation.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.allocation.batch.delay</code></td>
+ <td><code>1</code></td>
+ <td>
+ Number of seconds to wait between each round of executor pod
allocation.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.authenticate.submission.caCertFile</code></td>
+ <td>(none)</td>
+ <td>
+ Path to the CA cert file for connecting to the Kubernetes API server
over TLS when starting the driver. This file
+ must be located on the submitting machine's disk. Specify this as a
path as opposed to a URI (i.e. do not provide
+ a scheme).
+ </td>
+</tr>
+<tr>
+
<td><code>spark.kubernetes.authenticate.submission.clientKeyFile</code></td>
+ <td>(none)</td>
+ <td>
+ Path to the client key file for authenticating against the Kubernetes
API server when starting the driver. This file
+ must be located on the submitting machine's disk. Specify this as a
path as opposed to a URI (i.e. do not provide
+ a scheme).
+ </td>
+</tr>
+<tr>
+
<td><code>spark.kubernetes.authenticate.submission.clientCertFile</code></td>
+ <td>(none)</td>
+ <td>
+ Path to the client cert file for authenticating against the Kubernetes
API server when starting the driver. This
+ file must be located on the submitting machine's disk. Specify this as
a path as opposed to a URI (i.e. do not
+ provide a scheme).
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.authenticate.submission.oauthToken</code></td>
+ <td>(none)</td>
+ <td>
+ OAuth token to use when authenticating against the Kubernetes API
server when starting the driver. Note
+ that unlike the other authentication options, this is expected to be
the exact string value of the token to use for
+ the authentication.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.authenticate.driver.caCertFile</code></td>
+ <td>(none)</td>
+ <td>
+ Path to the CA cert file for connecting to the Kubernetes API server
over TLS from the driver pod when requesting
+ executors. This file must be located on the submitting machine's disk,
and will be uploaded to the driver pod.
+ Specify this as a path as opposed to a URI (i.e. do not provide a
scheme).
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.authenticate.driver.clientKeyFile</code></td>
+ <td>(none)</td>
+ <td>
+ Path to the client key file for authenticating against the Kubernetes
API server from the driver pod when requesting
+ executors. This file must be located on the submitting machine's disk,
and will be uploaded to the driver pod.
+ Specify this as a path as opposed to a URI (i.e. do not provide a
scheme). If this is specified, it is highly
+ recommended to set up TLS for the driver submission server, as this
value is sensitive information that would be
+ passed to the driver pod in plaintext otherwise.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.authenticate.driver.clientCertFile</code></td>
+ <td>(none)</td>
+ <td>
+ Path to the client cert file for authenticating against the Kubernetes
API server from the driver pod when
+ requesting executors. This file must be located on the submitting
machine's disk, and will be uploaded to the
+ driver pod. Specify this as a path as opposed to a URI (i.e. do not
provide a scheme).
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.authenticate.driver.oauthToken</code></td>
+ <td>(none)</td>
+ <td>
+ OAuth token to use when authenticating against the against the
Kubernetes API server from the driver pod when
+ requesting executors. Note that unlike the other authentication
options, this must be the exact string value of
+ the token to use for the authentication. This token value is uploaded
to the driver pod. If this is specified, it is
+ highly recommended to set up TLS for the driver submission server, as
this value is sensitive information that would
+ be passed to the driver pod in plaintext otherwise.
+ </td>
+</tr>
+<tr>
+
<td><code>spark.kubernetes.authenticate.driver.serviceAccountName</code></td>
+ <td><code>default</code></td>
+ <td>
+ Service account that is used when running the driver pod. The driver
pod uses this service account when requesting
+ executor pods from the API server. Note that this cannot be specified
alongside a CA cert file, client key file,
+ client cert file, and/or OAuth token.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.driver.label.[LabelName]</code></td>
+ <td>(none)</td>
+ <td>
+ Add the label specified by <code>LabelName</code> to the driver pod.
+ For example, <code>spark.kubernetes.driver.label.something=true</code>.
+ Note that Spark also adds its own labels to the driver pod
+ for bookkeeping purposes.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.driver.annotation.[AnnotationName]</code></td>
+ <td>(none)</td>
+ <td>
+ Add the annotation specified by <code>AnnotationName</code> to the
driver pod.
+ For example,
<code>spark.kubernetes.driver.annotation.something=true</code>.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.executor.label.[LabelName]</code></td>
+ <td>(none)</td>
+ <td>
+ Add the label specified by <code>LabelName</code> to the executor pods.
+ For example,
<code>spark.kubernetes.executor.label.something=true</code>.
+ Note that Spark also adds its own labels to the driver pod
+ for bookkeeping purposes.
+ </td>
+</tr>
+<tr>
+
<td><code>spark.kubernetes.executor.annotation.[AnnotationName]</code></td>
+ <td>(none)</td>
+ <td>
+ Add the annotation specified by <code>AnnotationName</code> to the
executor pods.
+ For example,
<code>spark.kubernetes.executor.annotation.something=true</code>.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.driver.pod.name</code></td>
+ <td>(none)</td>
+ <td>
+ Name of the driver pod. If not set, the driver pod name is set to
"spark.app.name" suffixed by the current timestamp
+ to avoid name conflicts.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.submission.waitAppCompletion</code></td>
+ <td><code>true</code></td>
+ <td>
+ In cluster mode, whether to wait for the application to finish before
exiting the launcher process. When changed to
+ false, the launcher has a "fire-and-forget" behavior when launching
the Spark job.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.report.interval</code></td>
+ <td><code>1s</code></td>
+ <td>
+ Interval between reports of the current Spark job status in cluster
mode.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.docker.image.pullPolicy</code></td>
+ <td><code>IfNotPresent</code></td>
+ <td>
+ Docker image pull policy used when pulling Docker images with
Kubernetes.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.kubernetes.driver.limit.cores</code></td>
+ <td>(none)</td>
+ <td>
+ Specify the hard cpu limit for the driver pod
+ </td>
+ </tr>
+ <tr>
+ <td><code>spark.kubernetes.executor.limit.cores</code></td>
+ <td>(none)</td>
+ <td>
+ Specify the hard cpu limit for a single executor pod
+ </td>
+ </tr>
+ <tr>
+ <td><code>spark.kubernetes.node.selector.[labelKey]</code></td>
+ <td>(none)</td>
+ <td>
+ Adds to the node selector of the driver pod and executor pods, with
key <code>labelKey</code> and the value as the
+ configuration's value. For example, setting
<code>spark.kubernetes.node.selector.identifier</code> to
<code>myIdentifier</code>
+ will result in the driver pod and executors having a node selector
with key <code>identifier</code> and value
+ <code>myIdentifier</code>. Multiple node selector keys can be added
by setting multiple configurations with this prefix.
+ </td>
+ </tr>
+ <tr>
+
<td><code>spark.kubernetes.driverEnv.[EnvironmentVariableName]</code></td>
+ <td>(none)</td>
+ <td>
+ Add the environment variable specified by
<code>EnvironmentVariableName</code> to
+ the Driver process. The user can specify multiple of these to set
multiple environment variables.
+ </td>
+ </tr>
+ <tr>
+ <td><code>spark.kubernetes.driver.secrets.[SecretName]</code></td>
+ <td>(none)</td>
+ <td>
+ Mounts the Kubernetes secret named <code>SecretName</code> onto the
path specified by the value
--- End diff --
Done
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]