+1 on the updated SPIP Xingbo Jiang <jiangxb1...@gmail.com> 于2019年3月26日周二 下午1:32写道:
> Hi all, > > Now we have had a few discussions over the updated SPIP, we also updated > the SPIP addressing new feedbacks from some committers. IMO the SPIP is > ready for another round of vote now. > On the updated SPIP, we currently have two +1s (from Tom and Xiangrui), > everyone else please vote again. > > The vote will be up for the next 72 hours. > > Thanks! > > Xingbo > > Xiangrui Meng <men...@gmail.com> 于2019年3月26日周二 上午11:32写道: > >> >> >> On Mon, Mar 25, 2019 at 8:07 PM Mark Hamstra <m...@clearstorydata.com> >> wrote: >> >>> Maybe. >>> >>> And I expect that we will end up doing something based on >>> spark.task.cpus in the short term. I'd just rather that this SPIP not make >>> it look like this is the way things should ideally be done. I'd prefer that >>> we be quite explicit in recognizing that this approach is a significant >>> compromise, and I'd like to see at least some references to the beginning >>> of serious longer-term efforts to do something better in a deeper re-design >>> of resource scheduling. >>> >> >> It is also a feature I desire as a user. How about suggesting it as a >> future work in the SPIP? It certainly requires someone who fully >> understands Spark scheduler to drive. Shall we start with a Spark JIRA? I >> don't know much about scheduler like you do, but I can speak for DL use >> cases. Maybe we just view it from different angles. To you >> application-level request is a significant compromise. To me it provides a >> major milestone that brings GPU to Spark workload. I know many users who >> tried to do DL on Spark ended up doing hacks here and there, huge pain. The >> scope covered by the current SPIP makes those users much happier. Tom and >> Andy from NVIDIA are certainly more calibrated on the usefulness of the >> current proposal. >> >> >>> >>> On Mon, Mar 25, 2019 at 7:39 PM Xiangrui Meng <m...@databricks.com> >>> wrote: >>> >>>> There are certainly use cases where different stages require different >>>> number of CPUs or GPUs under an optimal setting. I don't think anyone >>>> disagrees that ideally users should be able to do it. We are just dealing >>>> with typical engineering trade-offs and see how we break it down into >>>> smaller ones. I think it is fair to treat the task-level resource request >>>> as a separate feature here because it also applies to CPUs alone without >>>> GPUs, as Tom mentioned above. But having "spark.task.cpus" only for many >>>> years Spark is still able to cover many many use cases. Otherwise we >>>> shouldn't see many Spark users around now. Here we just apply similar >>>> arguments to GPUs. >>>> >>>> Initially, I was the person who really wanted task-level requests >>>> because it is ideal. In an offline discussion, Andy Feng pointed out an >>>> application-level setting should fit common deep learning training and >>>> inference cases and it greatly simplifies necessary changes required to >>>> Spark job scheduler. With Imran's feedback to the initial design sketch, >>>> the application-level approach became my first choice because it is still >>>> very valuable but much less risky. If a feature brings great value to >>>> users, we should add it even it is not ideal. >>>> >>>> Back to the default value discussion, let's forget GPUs and only >>>> consider CPUs. Would an application-level default number of CPU cores >>>> disappear if we added task-level requests? If yes, does it mean that users >>>> have to explicitly state the resource requirements for every single stage? >>>> It is tedious to do and who do not fully understand the impact would >>>> probably do it wrong and waste even more resources. Then how many cores >>>> each task should use if user didn't specify it? I do see "spark.task.cpus" >>>> is the answer here. The point I want to make is that "spark.task.cpus", >>>> though less ideal, is still needed when we have task-level requests for >>>> CPUs. >>>> >>>> On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra <m...@clearstorydata.com> >>>> wrote: >>>> >>>>> I remain unconvinced that a default configuration at the application >>>>> level makes sense even in that case. There may be some applications where >>>>> you know a priori that almost all the tasks for all the stages for all the >>>>> jobs will need some fixed number of gpus; but I think the more common >>>>> cases >>>>> will be dynamic configuration at the job or stage level. Stage level could >>>>> have a lot of overlap with barrier mode scheduling -- barrier mode stages >>>>> having a need for an inter-task channel resource, gpu-ified stages needing >>>>> gpu resources, etc. Have I mentioned that I'm not a fan of the current >>>>> barrier mode API, Xiangrui? :) Yes, I know: "Show me something better." >>>>> >>>>> On Mon, Mar 25, 2019 at 3:55 PM Xiangrui Meng <men...@gmail.com> >>>>> wrote: >>>>> >>>>>> Say if we support per-task resource requests in the future, it would >>>>>> be still inconvenient for users to declare the resource requirements for >>>>>> every single task/stage. So there must be some default values defined >>>>>> somewhere for task resource requirements. "spark.task.cpus" and >>>>>> "spark.task.accelerator.gpu.count" could serve for this purpose without >>>>>> introducing breaking changes. So I'm +1 on the updated SPIP. It fairly >>>>>> separated necessary GPU support from risky scheduler changes. >>>>>> >>>>>> On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra <m...@clearstorydata.com> >>>>>> wrote: >>>>>> >>>>>>> Of course there is an issue of the perfect becoming the enemy of the >>>>>>> good, so I can understand the impulse to get something done. I am left >>>>>>> wanting, however, at least something more of a roadmap to a task-level >>>>>>> future than just a vague "we may choose to do something more in the >>>>>>> future." At the risk of repeating myself, I don't think the >>>>>>> existing spark.task.cpus is very good, and I think that building more on >>>>>>> that weak foundation without a more clear path or stated intention to >>>>>>> move >>>>>>> to something better runs the risk of leaving Spark stuck in a bad >>>>>>> neighborhood. >>>>>>> >>>>>>> On Thu, Mar 21, 2019 at 10:10 AM Tom Graves <tgraves...@yahoo.com> >>>>>>> wrote: >>>>>>> >>>>>>>> While I agree with you that it would be ideal to have the task >>>>>>>> level resources and do a deeper redesign for the scheduler, I think >>>>>>>> that >>>>>>>> can be a separate enhancement like was discussed earlier in the thread. >>>>>>>> That feature is useful without GPU's. I do realize that they overlap >>>>>>>> some >>>>>>>> but I think the changes for this will be minimal to the scheduler, >>>>>>>> follow >>>>>>>> existing conventions, and it is an improvement over what we have now. I >>>>>>>> know many users will be happy to have this even without the task level >>>>>>>> scheduling as many of the conventions used now to scheduler gpus can >>>>>>>> easily >>>>>>>> be broken by one bad user. I think from the user point of view this >>>>>>>> gives many users an improvement and we can extend it later to cover >>>>>>>> more >>>>>>>> use cases. >>>>>>>> >>>>>>>> Tom >>>>>>>> On Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Hamstra < >>>>>>>> m...@clearstorydata.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> I understand the application-level, static, global nature >>>>>>>> of spark.task.accelerator.gpu.count and its similarity to the >>>>>>>> existing spark.task.cpus, but to me this feels like extending a >>>>>>>> weakness of >>>>>>>> Spark's scheduler, not building on its strengths. That is because I >>>>>>>> consider binding the number of cores for each task to an application >>>>>>>> configuration to be far from optimal. This is already far from the >>>>>>>> desired >>>>>>>> behavior when an application is running a wide range of jobs (as in a >>>>>>>> generic job-runner style of Spark application), some of which require >>>>>>>> or >>>>>>>> can benefit from multi-core tasks, others of which will just waste the >>>>>>>> extra cores allocated to their tasks. Ideally, the number of cores >>>>>>>> allocated to tasks would get pushed to an even finer granularity that >>>>>>>> jobs, >>>>>>>> and instead being a per-stage property. >>>>>>>> >>>>>>>> Now, of course, making allocation of general-purpose cores and >>>>>>>> domain-specific resources work in this finer-grained fashion is a lot >>>>>>>> more >>>>>>>> work than just trying to extend the existing resource allocation >>>>>>>> mechanisms >>>>>>>> to handle domain-specific resources, but it does feel to me like we >>>>>>>> should >>>>>>>> at least be considering doing that deeper redesign. >>>>>>>> >>>>>>>> On Thu, Mar 21, 2019 at 7:33 AM Tom Graves >>>>>>>> <tgraves...@yahoo.com.invalid> wrote: >>>>>>>> >>>>>>>> Tthe proposal here is that all your resources are static and the >>>>>>>> gpu per task config is global per application, meaning you ask for a >>>>>>>> certain amount memory, cpu, GPUs for every executor up front just like >>>>>>>> you >>>>>>>> do today and every executor you get is that size. This means that both >>>>>>>> static or dynamic allocation still work without explicitly adding more >>>>>>>> logic at this point. Since the config for gpu per task is global it >>>>>>>> means >>>>>>>> every task you want will need a certain ratio of cpu to gpu. Since >>>>>>>> that is >>>>>>>> a global you can't really have the scenario you mentioned, all tasks >>>>>>>> are >>>>>>>> assuming to need GPU. For instance. I request 5 cores, 2 GPUs, set 1 >>>>>>>> gpu >>>>>>>> per task for each executor. That means that I could only run 2 tasks >>>>>>>> and 3 >>>>>>>> cores would be wasted. The stage/task level configuration of >>>>>>>> resources was >>>>>>>> removed and is something we can do in a separate SPIP. >>>>>>>> We thought erroring would make it more obvious to the user. We >>>>>>>> could change this to a warning if everyone thinks that is better but I >>>>>>>> personally like the error until we can implement the per lower level >>>>>>>> per >>>>>>>> stage configuration. >>>>>>>> >>>>>>>> Tom >>>>>>>> >>>>>>>> On Thursday, March 21, 2019, 1:45:01 AM PDT, Marco Gaido < >>>>>>>> marcogaid...@gmail.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> Thanks for this SPIP. >>>>>>>> I cannot comment on the docs, but just wanted to highlight one >>>>>>>> thing. In page 5 of the SPIP, when we talk about DRA, I see: >>>>>>>> >>>>>>>> "For instance, if each executor consists 4 CPUs and 2 GPUs, and >>>>>>>> each task requires 1 CPU and 1GPU, then we shall throw an error on >>>>>>>> application start because we shall always have at least 2 idle CPUs per >>>>>>>> executor" >>>>>>>> >>>>>>>> I am not sure this is a correct behavior. We might have tasks >>>>>>>> requiring only CPU running in parallel as well, hence that may make >>>>>>>> sense. >>>>>>>> I'd rather emit a WARN or something similar. Anyway we just said we >>>>>>>> will >>>>>>>> keep GPU scheduling on task level out of scope for the moment, right? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Marco >>>>>>>> >>>>>>>> Il giorno gio 21 mar 2019 alle ore 01:26 Xiangrui Meng < >>>>>>>> m...@databricks.com> ha scritto: >>>>>>>> >>>>>>>> Steve, the initial work would focus on GPUs, but we will keep the >>>>>>>> interfaces general to support other accelerators in the future. This >>>>>>>> was >>>>>>>> mentioned in the SPIP and draft design. >>>>>>>> >>>>>>>> Imran, you should have comment permission now. Thanks for making a >>>>>>>> pass! I don't think the proposed 3.0 features should block Spark 3.0 >>>>>>>> release either. It is just an estimate of what we could deliver. I will >>>>>>>> update the doc to make it clear. >>>>>>>> >>>>>>>> Felix, it would be great if you can review the updated docs and let >>>>>>>> us know your feedback. >>>>>>>> >>>>>>>> ** How about setting a tentative vote closing time to next Tue (Mar >>>>>>>> 26)? >>>>>>>> >>>>>>>> On Wed, Mar 20, 2019 at 11:01 AM Imran Rashid <im...@therashids.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Thanks for sending the updated docs. Can you please give everyone >>>>>>>> the ability to comment? I have some comments, but overall I think >>>>>>>> this is >>>>>>>> a good proposal and addresses my prior concerns. >>>>>>>> >>>>>>>> My only real concern is that I notice some mention of "must dos" >>>>>>>> for spark 3.0. I don't want to make any commitment to holding spark >>>>>>>> 3.0 >>>>>>>> for parts of this, I think that is an entirely separate decision. >>>>>>>> However >>>>>>>> I'm guessing this is just a minor wording issue, and you really mean >>>>>>>> that's >>>>>>>> a minimal set of features you are aiming for, which is reasonable. >>>>>>>> >>>>>>>> On Mon, Mar 18, 2019 at 12:56 PM Xingbo Jiang < >>>>>>>> jiangxb1...@gmail.com> wrote: >>>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I updated the SPIP doc >>>>>>>> <https://docs.google.com/document/d/1C4J_BPOcSCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit#> >>>>>>>> and stories >>>>>>>> <https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit#heading=h.udyua28eu3sg>, >>>>>>>> I hope it now contains clear scope of the changes and enough details >>>>>>>> for >>>>>>>> SPIP vote. >>>>>>>> Please review the updated docs, thanks! >>>>>>>> >>>>>>>> Xiangrui Meng <men...@gmail.com> 于2019年3月6日周三 上午8:35写道: >>>>>>>> >>>>>>>> How about letting Xingbo make a major revision to the SPIP doc to >>>>>>>> make it clear what proposed are? I like Felix's suggestion to switch >>>>>>>> to the >>>>>>>> new Heilmeier template, which helps clarify what are proposed and what >>>>>>>> are >>>>>>>> not. Then let's review the new SPIP and resume the vote. >>>>>>>> >>>>>>>> On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid <im...@therashids.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> OK, I suppose then we are getting bogged down into what a vote on >>>>>>>> an SPIP means then anyway, which I guess we can set aside for now. >>>>>>>> With >>>>>>>> the level of detail in this proposal, I feel like there is a reasonable >>>>>>>> chance I'd still -1 the design or implementation. >>>>>>>> >>>>>>>> And the other thing you're implicitly asking the community for is >>>>>>>> to prioritize this feature for continued review and maintenance. >>>>>>>> There is >>>>>>>> already work to be done in things like making barrier mode support >>>>>>>> dynamic >>>>>>>> allocation (SPARK-24942), bugs in failure handling (eg. SPARK-25250), >>>>>>>> and >>>>>>>> general efficiency of failure handling (eg. SPARK-25341, SPARK-20178). >>>>>>>> I'm >>>>>>>> very concerned about getting spread too thin. >>>>>>>> >>>>>>>> >>>>>>>> But if this is really just a vote on (1) is better gpu support >>>>>>>> important for spark, in some form, in some release? and (2) is it >>>>>>>> *possible* to do this in a safe way? then I will vote +0. >>>>>>>> >>>>>>>> On Tue, Mar 5, 2019 at 8:25 AM Tom Graves <tgraves...@yahoo.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> So to me most of the questions here are implementation/design >>>>>>>> questions, I've had this issue in the past with SPIP's where I >>>>>>>> expected to >>>>>>>> have more high level design details but was basically told that >>>>>>>> belongs in >>>>>>>> the design jira follow on. This makes me think we need to revisit what >>>>>>>> a >>>>>>>> SPIP really need to contain, which should be done in a separate thread. >>>>>>>> Note personally I would be for having more high level details in it. >>>>>>>> But the way I read our documentation on a SPIP right now that >>>>>>>> detail is all optional, now maybe we could argue its based on what >>>>>>>> reviewers request, but really perhaps we should make the wording of >>>>>>>> that >>>>>>>> more required. thoughts? We should probably separate that discussion >>>>>>>> if >>>>>>>> people want to talk about that. >>>>>>>> >>>>>>>> For this SPIP in particular the reason I +1 it is because it came >>>>>>>> down to 2 questions: >>>>>>>> >>>>>>>> 1) do I think spark should support this -> my answer is yes, I >>>>>>>> think this would improve spark, users have been requesting both better >>>>>>>> GPUs >>>>>>>> support and support for controlling container requests at a finer >>>>>>>> granularity for a while. If spark doesn't support this then users may >>>>>>>> go >>>>>>>> to something else, so I think it we should support it >>>>>>>> >>>>>>>> 2) do I think its possible to design and implement it without >>>>>>>> causing large instabilities? My opinion here again is yes. I agree >>>>>>>> with >>>>>>>> Imran and others that the scheduler piece needs to be looked at very >>>>>>>> closely as we have had a lot of issues there and that is why I was >>>>>>>> asking >>>>>>>> for more details in the design jira: >>>>>>>> https://issues.apache.org/jira/browse/SPARK-27005. But I do >>>>>>>> believe its possible to do. >>>>>>>> >>>>>>>> If others have reservations on similar questions then I think we >>>>>>>> should resolve here or take the discussion of what a SPIP is to a >>>>>>>> different >>>>>>>> thread and then come back to this, thoughts? >>>>>>>> >>>>>>>> Note there is a high level design for at least the core piece, >>>>>>>> which is what people seem concerned with, already so including it in >>>>>>>> the >>>>>>>> SPIP should be straight forward. >>>>>>>> >>>>>>>> Tom >>>>>>>> >>>>>>>> On Monday, March 4, 2019, 2:52:43 PM CST, Imran Rashid < >>>>>>>> im...@therashids.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng <men...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung < >>>>>>>> felixcheun...@hotmail.com> wrote: >>>>>>>> >>>>>>>> IMO upfront allocation is less useful. Specifically too expensive >>>>>>>> for large jobs. >>>>>>>> >>>>>>>> >>>>>>>> This is also an API/design discussion. >>>>>>>> >>>>>>>> >>>>>>>> I agree with Felix -- this is more than just an API question. It >>>>>>>> has a huge impact on the complexity of what you're proposing. You >>>>>>>> might be >>>>>>>> proposing big changes to a core and brittle part of spark, which is >>>>>>>> already >>>>>>>> short of experts. >>>>>>>> >>>>>>>> I don't see any value in having a vote on "does feature X sound >>>>>>>> cool?" We have to evaluate the potential benefit against the risks the >>>>>>>> feature brings and the continued maintenance cost. We don't need super >>>>>>>> low-level details, but we have to a sketch of the design to be able to >>>>>>>> make >>>>>>>> that tradeoff. >>>>>>>> >>>>>>>>