+1 (non-binding) On Tue, Aug 15, 2017 at 10:19 PM, Shubham Chopra <shubham.cho...@gmail.com> wrote:
> +1 (non-binding) > > ~Shubham. > > On Tue, Aug 15, 2017 at 2:11 PM, Erik Erlandson <eerla...@redhat.com> > wrote: > >> >> Kubernetes has evolved into an important container orchestration >> platform; it has a large and growing user base and an active ecosystem. >> Users of Apache Spark who are also deploying applications on Kubernetes (or >> are planning to) will have convergence-related motivations for migrating >> their Spark applications to Kubernetes as well. It avoids the need for >> deploying separate cluster infra for Spark workloads and allows Spark >> applications to take full advantage of inhabiting the same orchestration >> environment as other applications. In this respect, native Kubernetes >> support for Spark represents a way to optimize uptake and retention of >> Apache Spark among the members of the expanding Kubernetes community. >> >> On Tue, Aug 15, 2017 at 8:43 AM, Erik Erlandson <eerla...@redhat.com> >> wrote: >> >>> +1 (non-binding) >>> >>> >>> On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan <fox...@google.com> >>> wrote: >>> >>>> Spark on Kubernetes effort has been developed separately in a fork, and >>>> linked back from the Apache Spark project as an experimental backend >>>> <http://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types>. >>>> We're ~6 months in, have had 5 releases >>>> <https://github.com/apache-spark-on-k8s/spark/releases>. >>>> >>>> - 2 Spark versions maintained (2.1, and 2.2) >>>> - Extensive integration testing and refactoring efforts to maintain >>>> code quality >>>> - Developer >>>> <https://github.com/apache-spark-on-k8s/spark#getting-started> and >>>> user-facing <https://apache-spark-on-k8s.github.io/userdocs/> docu >>>> mentation >>>> - 10+ consistent code contributors from different organizations >>>> >>>> <https://apache-spark-on-k8s.github.io/userdocs/contribute.html#project-contributions> >>>> involved >>>> in actively maintaining and using the project, with several more members >>>> involved in testing and providing feedback. >>>> - The community has delivered several talks on Spark-on-Kubernetes >>>> generating lots of feedback from users. >>>> - In addition to these, we've seen efforts spawn off such as: >>>> - HDFS on Kubernetes >>>> <https://github.com/apache-spark-on-k8s/kubernetes-HDFS> with >>>> Locality and Performance Experiments >>>> - Kerberized access >>>> >>>> <https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg/edit> >>>> to >>>> HDFS from Spark running on Kubernetes >>>> >>>> *Following the SPIP process, I'm putting this SPIP up for a vote.* >>>> >>>> - +1: Yeah, let's go forward and implement the SPIP. >>>> - +0: Don't really care. >>>> - -1: I don't think this is a good idea because of the following >>>> technical reasons. >>>> >>>> If there is any further clarification desired, on the design or the >>>> implementation, please feel free to ask questions or provide feedback. >>>> >>>> >>>> SPIP: Kubernetes as A Native Cluster Manager >>>> >>>> Full Design Doc: link >>>> <https://issues.apache.org/jira/secure/attachment/12881586/SPARK-18278%20Spark%20on%20Kubernetes%20Design%20Proposal%20Revision%202%20%281%29.pdf> >>>> >>>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278 >>>> >>>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377 >>>> >>>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt >>>> Cheah, >>>> >>>> Ilan Filonenko, Sean Suchter, Kimoon Kim >>>> Background and Motivation >>>> >>>> Containerization and cluster management technologies are constantly >>>> evolving in the cluster computing world. Apache Spark currently implements >>>> support for Apache Hadoop YARN and Apache Mesos, in addition to providing >>>> its own standalone cluster manager. In 2014, Google announced development >>>> of Kubernetes <https://kubernetes.io/> which has its own unique >>>> feature set and differentiates itself from YARN and Mesos. Since its debut, >>>> it has seen contributions from over 1300 contributors with over 50000 >>>> commits. Kubernetes has cemented itself as a core player in the cluster >>>> computing world, and cloud-computing providers such as Google Container >>>> Engine, Google Compute Engine, Amazon Web Services, and Microsoft Azure >>>> support running Kubernetes clusters. >>>> >>>> This document outlines a proposal for integrating Apache Spark with >>>> Kubernetes in a first class way, adding Kubernetes to the list of cluster >>>> managers that Spark can be used with. Doing so would allow users to share >>>> their computing resources and containerization framework between their >>>> existing applications on Kubernetes and their computational Spark >>>> applications. Although there is existing support for running a Spark >>>> standalone cluster on Kubernetes >>>> <https://github.com/kubernetes/examples/blob/master/staging/spark/README.md>, >>>> there are still major advantages and significant interest in having native >>>> execution support. For example, this integration provides better support >>>> for multi-tenancy and dynamic resource allocation. It also allows users to >>>> run applications of different Spark versions of their choices in the same >>>> cluster. >>>> >>>> The feature is being developed in a separate fork >>>> <https://github.com/apache-spark-on-k8s/spark> in order to minimize >>>> risk to the main project during development. Since the start of the >>>> development in November of 2016, it has received over 100 commits from over >>>> 20 contributors and supports two releases based on Spark 2.1 and 2.2 >>>> respectively. Documentation is also being actively worked on both in the >>>> main project repository and also in the repository >>>> https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world >>>> use cases, we have seen cluster setup that uses 1000+ cores. We are also >>>> seeing growing interests on this project from more and more organizations. >>>> >>>> While it is easy to bootstrap the project in a forked repository, it is >>>> hard to maintain it in the long run because of the tricky process of >>>> rebasing onto the upstream and lack of awareness in the large Spark >>>> community. It would be beneficial to both the Spark and Kubernetes >>>> community seeing this feature being merged upstream. On one hand, it gives >>>> Spark users the option of running their Spark workloads along with other >>>> workloads that may already be running on Kubernetes, enabling better >>>> resource sharing and isolation, and better cluster administration. On the >>>> other hand, it gives Kubernetes a leap forward in the area of large-scale >>>> data processing by being an officially supported cluster manager for Spark. >>>> The risk of merging into upstream is low because most of the changes are >>>> purely incremental, i.e., new Kubernetes-aware implementations of existing >>>> interfaces/classes in Spark core are introduced. The development is also >>>> concentrated in a single place at resource-managers/kubernetes >>>> <https://github.com/apache-spark-on-k8s/spark/tree/branch-2.2-kubernetes/resource-managers/kubernetes>. >>>> The risk is further reduced by a comprehensive integration test framework, >>>> and an active and responsive community of future maintainers. >>>> Target Personas >>>> >>>> Devops, data scientists, data engineers, application developers, anyone >>>> who can benefit from having Kubernetes >>>> <https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/> as >>>> a native cluster manager for Spark. >>>> Goals >>>> >>>> - >>>> >>>> Make Kubernetes a first-class cluster manager for Spark, alongside >>>> Spark Standalone, Yarn, and Mesos. >>>> - >>>> >>>> Support both client and cluster deployment mode. >>>> - >>>> >>>> Support dynamic resource allocation >>>> >>>> <http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation> >>>> . >>>> - >>>> >>>> Support Spark Java/Scala, PySpark, and Spark R applications. >>>> - >>>> >>>> Support secure HDFS access. >>>> - >>>> >>>> Allow running applications of different Spark versions in the same >>>> cluster through the ability to specify the driver and executor Docker >>>> images on a per-application basis. >>>> - >>>> >>>> Support specification and enforcement of limits on both CPU cores >>>> and memory. >>>> >>>> Non-Goals >>>> >>>> - >>>> >>>> Support cluster resource scheduling and sharing beyond capabilities >>>> offered natively by the Kubernetes per-namespace resource quota model. >>>> >>>> Proposed API Changes >>>> >>>> Most API changes are purely incremental, i.e., new Kubernetes-aware >>>> implementations of existing interfaces/classes in Spark core are >>>> introduced. Detailed changes are as follows. >>>> >>>> - >>>> >>>> A new cluster manager option KUBERNETES is introduced and some >>>> changes are made to SparkSubmit to make it be aware of this option. >>>> - >>>> >>>> A new implementation of CoarseGrainedSchedulerBackend, namely >>>> KubernetesClusterSchedulerBackend is responsible for managing the >>>> creation and deletion of executor Pods through the Kubernetes API. >>>> - >>>> >>>> A new implementation of TaskSchedulerImpl, namely >>>> KubernetesTaskSchedulerImpl, and a new implementation of >>>> TaskSetManager, namely Kubernetes TaskSetManager, are introduced >>>> for Kubernetes-aware task scheduling. >>>> - >>>> >>>> When dynamic resource allocation is enabled, a new implementation >>>> of ExternalShuffleService, namely KubernetesExternalShuffleService >>>> is introduced. >>>> >>>> Design Sketch >>>> >>>> Below we briefly describe the design. For more details on the design >>>> and architecture, please refer to the architecture documentation >>>> <https://github.com/apache-spark-on-k8s/spark/tree/branch-2.2-kubernetes/resource-managers/kubernetes/architecture-docs>. >>>> The main idea of this design is to run Spark driver and executors inside >>>> Kubernetes Pods >>>> <https://kubernetes.io/docs/concepts/workloads/pods/pod/>. Pods are a >>>> co-located and co-scheduled group of one or more containers run in a shared >>>> context. The driver is responsible for creating and destroying executor >>>> Pods through the Kubernetes API, while Kubernetes is fully responsible for >>>> scheduling the Pods to run on available nodes in the cluster. In the >>>> cluster mode, the driver also runs in a Pod in the cluster, created through >>>> the Kubernetes API by a Kubernetes-aware submission client called by the >>>> spark-submit script. Because the driver runs in a Pod, it is reachable >>>> by the executors in the cluster using its Pod IP. In the client mode, the >>>> driver runs outside the cluster and calls the Kubernetes API to create and >>>> destroy executor Pods. The driver must be routable from within the cluster >>>> for the executors to communicate with it. >>>> >>>> The main component running in the driver is the >>>> KubernetesClusterSchedulerBackend, an implementation of >>>> CoarseGrainedSchedulerBackend, which manages allocating and destroying >>>> executors via the Kubernetes API, as instructed by Spark core via calls to >>>> methods doRequestTotalExecutors and doKillExecutors, respectively. >>>> Within the KubernetesClusterSchedulerBackend, a separate >>>> kubernetes-pod-allocator thread handles the creation of new executor >>>> Pods with appropriate throttling and monitoring. Throttling is achieved >>>> using a feedback loop that makes decision on submitting new requests for >>>> executors based on whether previous executor Pod creation requests have >>>> completed. This indirection is necessary because the Kubernetes API server >>>> accepts requests for new Pods optimistically, with the anticipation of >>>> being able to eventually schedule them to run. However, it is undesirable >>>> to have a very large number of Pods that cannot be scheduled and stay >>>> pending within the cluster. The throttling mechanism gives us control over >>>> how fast an application scales up (which can be configured), and helps >>>> prevent Spark applications from DOS-ing the Kubernetes API server with too >>>> many Pod creation requests. The executor Pods simply run the >>>> CoarseGrainedExecutorBackend class from a pre-built Docker image that >>>> contains a Spark distribution. >>>> >>>> There are auxiliary and optional components: ResourceStagingServer and >>>> KubernetesExternalShuffleService, which serve specific purposes >>>> described below. The ResourceStagingServer serves as a file store (in >>>> the absence of a persistent storage layer in Kubernetes) for application >>>> dependencies uploaded from the submission client machine, which then get >>>> downloaded from the server by the init-containers in the driver and >>>> executor Pods. It is a Jetty server with JAX-RS and has two endpoints for >>>> uploading and downloading files, respectively. Security tokens are returned >>>> in the responses for file uploading and must be carried in the requests for >>>> downloading the files. The ResourceStagingServer is deployed as a >>>> Kubernetes Service >>>> <https://kubernetes.io/docs/concepts/services-networking/service/> >>>> backed by a Deployment >>>> <https://kubernetes.io/docs/concepts/workloads/controllers/deployment/> >>>> in the cluster and multiple instances may be deployed in the same cluster. >>>> Spark applications specify which ResourceStagingServer instance to use >>>> through a configuration property. >>>> >>>> The KubernetesExternalShuffleService is used to support dynamic >>>> resource allocation, with which the number of executors of a Spark >>>> application can change at runtime based on the resource needs. It provides >>>> an additional endpoint for drivers that allows the shuffle service to >>>> delete driver termination and clean up the shuffle files associated with >>>> corresponding application. There are two ways of deploying the >>>> KubernetesExternalShuffleService: running a shuffle service Pod on >>>> each node in the cluster or a subset of the nodes using a DaemonSet >>>> <https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/>, >>>> or running a shuffle service container in each of the executor Pods. In the >>>> first option, each shuffle service container mounts a hostPath >>>> <https://kubernetes.io/docs/concepts/storage/volumes/#hostpath> >>>> volume. The same hostPath volume is also mounted by each of the executor >>>> containers, which must also have the environment variable >>>> SPARK_LOCAL_DIRS point to the hostPath. In the second option, a >>>> shuffle service container is co-located with an executor container in each >>>> of the executor Pods. The two containers share an emptyDir >>>> <https://kubernetes.io/docs/concepts/storage/volumes/#emptydir> volume >>>> where the shuffle data gets written to. There may be multiple instances of >>>> the shuffle service deployed in a cluster that may be used for different >>>> versions of Spark, or for different priority levels with different resource >>>> quotas. >>>> >>>> New Kubernetes-specific configuration options are also introduced to >>>> facilitate specification and customization of driver and executor Pods and >>>> related Kubernetes resources. For example, driver and executor Pods can be >>>> created in a particular Kubernetes namespace and on a particular set of the >>>> nodes in the cluster. Users are allowed to apply labels and annotations to >>>> the driver and executor Pods. >>>> >>>> Additionally, secure HDFS support is being actively worked on following >>>> the design here >>>> <https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg/edit>. >>>> Both short-running jobs and long-running jobs that need periodic delegation >>>> token refresh are supported, leveraging built-in Kubernetes constructs like >>>> Secrets. Please refer to the design doc for details. >>>> Rejected DesignsResource Staging by the Driver >>>> >>>> A first implementation effectively included the ResourceStagingServer >>>> in the driver container itself. The driver container ran a custom command >>>> that opened an HTTP endpoint and waited for the submission client to send >>>> resources to it. The server would then run the driver code after it had >>>> received the resources from the submission client machine. The problem with >>>> this approach is that the submission client needs to deploy the driver in >>>> such a way that the driver itself would be reachable from outside of the >>>> cluster, but it is difficult for an automated framework which is not aware >>>> of the cluster's configuration to expose an arbitrary pod in a generic way. >>>> The Service-based design chosen allows a cluster administrator to expose >>>> the ResourceStagingServer in a manner that makes sense for their >>>> cluster, such as with an Ingress or with a NodePort service. >>>> Kubernetes External Shuffle Service >>>> >>>> Several alternatives were considered for the design of the shuffle >>>> service. The first design postulated the use of long-lived executor pods >>>> and sidecar containers in them running the shuffle service. The advantage >>>> of this model was that it would let us use emptyDir for sharing as opposed >>>> to using node local storage, which guarantees better lifecycle management >>>> of storage by Kubernetes. The apparent disadvantage was that it would be a >>>> departure from the traditional Spark methodology of keeping executors for >>>> only as long as required in dynamic allocation mode. It would additionally >>>> use up more resources than strictly necessary during the course of >>>> long-running jobs, partially losing the advantage of dynamic scaling. >>>> >>>> Another alternative considered was to use a separate shuffle service >>>> manager as a nameserver. This design has a few drawbacks. First, this means >>>> another component that needs authentication/authorization management and >>>> maintenance. Second, this separate component needs to be kept in sync with >>>> the Kubernetes cluster. Last but not least, most of functionality of this >>>> separate component can be performed by a combination of the in-cluster >>>> shuffle service and the Kubernetes API server. >>>> Pluggable Scheduler Backends >>>> >>>> Fully pluggable scheduler backends were considered as a more >>>> generalized solution, and remain interesting as a possible avenue for >>>> future-proofing against new scheduling targets. For the purposes of this >>>> project, adding a new specialized scheduler backend for Kubernetes was >>>> chosen as the approach due to its very low impact on the core Spark code; >>>> making scheduler fully pluggable would be a high-impact high-risk >>>> modification to Spark’s core libraries. The pluggable scheduler backends >>>> effort is being tracked in JIRA-19700 >>>> <https://issues.apache.org/jira/browse/SPARK-19700>. >>>> >>>> >>> >> >