On Mon, Mar 4, 2019 at 8:23 AM Xiangrui Meng <men...@gmail.com> wrote:
> > > On Mon, Mar 4, 2019 at 7:24 AM Sean Owen <sro...@gmail.com> wrote: > >> To be clear, those goals sound fine to me. I don't think voting on >> those two broad points is meaningful, but, does no harm per se. If you >> mean this is just a check to see if people believe this is broadly >> worthwhile, then +1 from me. Yes it is. >> >> That means we'd want to review something more detailed later, whether >> it's a a) design doc we vote on or b) a series of pull requests. Given >> the number of questions this leaves open, a) sounds better and I think >> what you're suggesting. I'd call that the SPIP, but, so what, it's >> just a name. The thing is, a) seems already mostly done, in the second >> document that was attached. > > > It is far from done. We still need to review the APIs and the design for > each major component: > > * Internal changes to Spark job scheduler. > * Interfaces exposed to users. > * Interfaces exposed to cluster managers. > * Standalone / auto-discovery. > * YARN > * K8s > * Mesos > * Jenkins > > I try to avoid discussing each of them in this thread because they require > different domain experts. After we have a high-level agreement on adding > accelerator support to Spark. We can kick off the work in parallel. If any > committer thinks a follow-up work still needs an SPIP, we just follow the > SPIP process to resolve it. > > >> I'm hesitating because i'm not sure why >> it's important to not discuss that level of detail here, as it's >> already available. Just too much noise? > > > Yes. If we go down one or two levels, we might have to pull in different > domain experts for different questions. > > >> but voting for this seems like >> endorsing those decisions, as I can only assume the proposer is going >> to continue the design with those decisions in mind. >> > > That is certainly not the purpose, which was why there were two docs, not > just one SPIP. The purpose of the companion doc is just to give some > concrete stories and estimate what could be done in Spark 3.0. Maybe we > should update the SPIP doc and make it clear that certain features are > pending follow-up discussions. > > >> >> What's the next step in your view, after this, and before it's >> implemented? as long as there is one, sure, let's punt. Seems like we >> could begin that conversation nowish. >> > > We should assign each major component an "owner" who can lead the > follow-up work, e.g., > > * Internal changes to Spark scheduler > * Interfaces to cluster managers and users > * Standalone support > * YARN support > * K8s support > * Mesos support > * Test infrastructure > * FPGA > > Again, for each component the question we should answer first is "Is it > important?" and then "How to implement it?". Community members who are > interested in each discussion should subscribe to the corresponding JIRA. > If some committer think we need a follow-up SPIP, either to make more > members aware of the changes or to reach agreement, feel free to call it > out. > > >> >> Many of those questions you list are _fine_ for a SPIP, in my opinion. >> (Of course, I'd add what cluster managers are in/out of scope.) >> > > I think the two requires more discussion are Mesos and K8s. Let me follow > what I suggested above and try to answer two questions for each: > > Mesos: > * Is it important? There are certainly Spark/Mesos users but the overall > usage is going downhill. See the attached Google Trend snapshot. > [image: Screen Shot 2019-03-04 at 8.10.50 AM.png] > * How to implement it? I believe it is doable, similarly to other cluster > managers. However, we need to find someone from our community to do the > work. If we cannot find such a person, it would indicate that the feature > is not that important. > > K8s: > * Is it important? K8s is the fastest growing manager. But the current > Spark support is experimental. Building features on top would add > additional cost if we want to make changes. > * How to implement it? There is a sketch in the companion doc. Yinan > mentioned three options to expose the inferences to users. We need to > finalize the design and discuss which option is the best to go. > > You see that such discussions can be done in parallel. It is not efficient > if we block the work on K8s because we cannot decide whether we should > support Mesos. > > >> >> >> On Mon, Mar 4, 2019 at 9:07 AM Xiangrui Meng <men...@gmail.com> wrote: >> > >> > What finer "high level" goals do you recommend? To make progress on the >> vote, it would be great if you can articulate more. Current SPIP proposes >> two high-level changes to make Spark accelerator-aware: >> > >> > At cluster manager level, we update or upgrade cluster managers to >> include GPU support. Then we expose user interfaces for Spark to request >> GPUs from them. >> > Within Spark, we update its scheduler to understand available GPUs >> allocated to executors, user task requests, and assign GPUs to tasks >> properly. >> > >> > How do you want to change or refine them? I saw you raised questions >> around Horovod requirements and GPU/memory allocation. But there are tens >> of questions at the same or even higher level. E.g., in preparing the >> companion scoping doc we saw the following questions: >> > >> > * How to test GPU support on Jenkins? >> > * Does the solution proposed also work for FPGA? What are the diffs? >> > * How to make standalone workers auto-discover GPU resources? >> > * Do we want to allow users to request GPU resources in Pandas UDF? >> > * How does user pass the GPU requests to K8s, spark-submit command-line >> or pod template? >> > * Do we create a separate queue for GPU task scheduling so it doesn't >> cause regression on normal jobs? >> > * How to monitor the utilization of GPU? At what levels? >> > * Do we want to support GPU-backed physical operators? >> > * Do we allow users to request both non-default number of CPUs and GPUs? >> > * ... >> > >> > IMHO, we cannot nor we should answer questions at this level in this >> vote. The vote is majorly on whether we should make Spark accelerator-aware >> to help unify big data and AI solutions, specifically whether Spark should >> provide proper support to deep learning model training and inference where >> accelerators are essential. My +1 vote is based on the following logic: >> > >> > * It is important for Spark to become the de facto solution in >> connecting big data and AI. >> > * The work is doable given the design sketch and the early >> investigation/scoping. >> > >> > To me, "-1" means either it is not important for Spark to support such >> use cases or we certainly cannot afford to implement such support. This is >> my understanding of the SPIP and the vote. It would be great if you can >> elaborate what changes you want to make or what answers you want to see. >> > >> >