Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/19946#discussion_r157111248 --- Diff: docs/running-on-kubernetes.md --- @@ -0,0 +1,502 @@ +--- +layout: global +title: Running Spark on Kubernetes +--- +* This will become a table of contents (this text will be scraped). +{:toc} + +Spark can run on clusters managed by [Kubernetes](https://kubernetes.io). This feature makes use of native +Kubernetes scheduler that has been added to Spark. + +# Prerequisites + +* A runnable distribution of Spark 2.3 or above. +* A running Kubernetes cluster at version >= 1.6 with access configured to it using +[kubectl](https://kubernetes.io/docs/user-guide/prereqs/). If you do not already have a working Kubernetes cluster, +you may setup a test cluster on your local machine using +[minikube](https://kubernetes.io/docs/getting-started-guides/minikube/). + * We recommend using the latest release of minikube with the DNS addon enabled. +* You must have appropriate permissions to list, create, edit and delete +[pods](https://kubernetes.io/docs/user-guide/pods/) in your cluster. You can verify that you can list these resources +by running `kubectl auth can-i <list|create|edit|delete> pods`. + * The service account credentials used by the driver pods must be allowed to create pods, services and configmaps. +* You must have [Kubernetes DNS](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/) configured in your cluster. + +# How it works + +<p style="text-align: center;"> + <img src="img/k8s-cluster-mode.png" title="Spark cluster components" alt="Spark cluster components" /> +</p> + +<code>spark-submit</code> can be directly used to submit a Spark application to a Kubernetes cluster. +The submission mechanism works as follows: + +* Spark creates a Spark driver running within a [Kubernetes pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/). +* The driver creates executors which are also running within Kubernetes pods and connects to them, and executes application code. +* When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists +logs and remains in "completed" state in the Kubernetes API until it's eventually garbage collected or manually cleaned up. + +Note that in the completed state, the driver pod does *not* use any computational or memory resources. + +The driver and executor pod scheduling is handled by Kubernetes. It will be possible to affect Kubernetes scheduling +decisions for driver and executor pods using advanced primitives like +[node selectors](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector) +and [node/pod affinities](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity) +in a future release. + +# Submitting Applications to Kubernetes + +## Docker Images + +Kubernetes requires users to supply images that can be deployed into containers within pods. The images are built to +be run in a container runtime environment that Kubernetes supports. Docker is a container runtime environment that is --- End diff -- I see your point - Although there is flexibility in theory, as of now, it's safe to assume that *most* people are running docker containers when using k8s - making the name docker much more intuitive. If the other runtimes do see traction in future (and we do some testing around them), we can rename to `container.image` instead of `docker.image`. As of now, I can make the documentation clearer that spark on k8s only supports docker images. Sound like a reasonable thing to do here?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org