Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Xingbo Jiang Mon, 25 Mar 2019 22:33:52 -0700

+1 on the updated SPIP

Xingbo Jiang <jiangxb1...@gmail.com> 于2019年3月26日周二 下午1:32写道：


> Hi all,
>
> Now we have had a few discussions over the updated SPIP, we also updated
> the SPIP addressing new feedbacks from some committers. IMO the SPIP is
> ready for another round of vote now.
> On the updated SPIP, we currently have two +1s (from Tom and Xiangrui),
> everyone else please vote again.
>
> The vote will be up for the next 72 hours.
>
> Thanks!
>
> Xingbo
>
> Xiangrui Meng <men...@gmail.com> 于2019年3月26日周二 上午11:32写道：
>
>>
>>
>> On Mon, Mar 25, 2019 at 8:07 PM Mark Hamstra <m...@clearstorydata.com>
>> wrote:
>>
>>> Maybe.
>>>
>>> And I expect that we will end up doing something based on
>>> spark.task.cpus in the short term. I'd just rather that this SPIP not make
>>> it look like this is the way things should ideally be done. I'd prefer that
>>> we be quite explicit in recognizing that this approach is a significant
>>> compromise, and I'd like to see at least some references to the beginning
>>> of serious longer-term efforts to do something better in a deeper re-design
>>> of resource scheduling.
>>>
>>
>> It is also a feature I desire as a user. How about suggesting it as a
>> future work in the SPIP? It certainly requires someone who fully
>> understands Spark scheduler to drive. Shall we start with a Spark JIRA? I
>> don't know much about scheduler like you do, but I can speak for DL use
>> cases. Maybe we just view it from different angles. To you
>> application-level request is a significant compromise. To me it provides a
>> major milestone that brings GPU to Spark workload. I know many users who
>> tried to do DL on Spark ended up doing hacks here and there, huge pain. The
>> scope covered by the current SPIP makes those users much happier. Tom and
>> Andy from NVIDIA are certainly more calibrated on the usefulness of the
>> current proposal.
>>
>>
>>>
>>> On Mon, Mar 25, 2019 at 7:39 PM Xiangrui Meng <m...@databricks.com>
>>> wrote:
>>>
>>>> There are certainly use cases where different stages require different
>>>> number of CPUs or GPUs under an optimal setting. I don't think anyone
>>>> disagrees that ideally users should be able to do it. We are just dealing
>>>> with typical engineering trade-offs and see how we break it down into
>>>> smaller ones. I think it is fair to treat the task-level resource request
>>>> as a separate feature here because it also applies to CPUs alone without
>>>> GPUs, as Tom mentioned above. But having "spark.task.cpus" only for many
>>>> years Spark is still able to cover many many use cases. Otherwise we
>>>> shouldn't see many Spark users around now. Here we just apply similar
>>>> arguments to GPUs.
>>>>
>>>> Initially, I was the person who really wanted task-level requests
>>>> because it is ideal. In an offline discussion, Andy Feng pointed out an
>>>> application-level setting should fit common deep learning training and
>>>> inference cases and it greatly simplifies necessary changes required to
>>>> Spark job scheduler. With Imran's feedback to the initial design sketch,
>>>> the application-level approach became my first choice because it is still
>>>> very valuable but much less risky. If a feature brings great value to
>>>> users, we should add it even it is not ideal.
>>>>
>>>> Back to the default value discussion, let's forget GPUs and only
>>>> consider CPUs. Would an application-level default number of CPU cores
>>>> disappear if we added task-level requests? If yes, does it mean that users
>>>> have to explicitly state the resource requirements for every single stage?
>>>> It is tedious to do and who do not fully understand the impact would
>>>> probably do it wrong and waste even more resources. Then how many cores
>>>> each task should use if user didn't specify it? I do see "spark.task.cpus"
>>>> is the answer here. The point I want to make is that "spark.task.cpus",
>>>> though less ideal, is still needed when we have task-level requests for
>>>> CPUs.
>>>>
>>>> On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra <m...@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> I remain unconvinced that a default configuration at the application
>>>>> level makes sense even in that case. There may be some applications where
>>>>> you know a priori that almost all the tasks for all the stages for all the
>>>>> jobs will need some fixed number of gpus; but I think the more common 
>>>>> cases
>>>>> will be dynamic configuration at the job or stage level. Stage level could
>>>>> have a lot of overlap with barrier mode scheduling -- barrier mode stages
>>>>> having a need for an inter-task channel resource, gpu-ified stages needing
>>>>> gpu resources, etc. Have I mentioned that I'm not a fan of the current
>>>>> barrier mode API, Xiangrui? :) Yes, I know: "Show me something better."
>>>>>
>>>>> On Mon, Mar 25, 2019 at 3:55 PM Xiangrui Meng <men...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Say if we support per-task resource requests in the future, it would
>>>>>> be still inconvenient for users to declare the resource requirements for
>>>>>> every single task/stage. So there must be some default values defined
>>>>>> somewhere for task resource requirements. "spark.task.cpus" and
>>>>>> "spark.task.accelerator.gpu.count" could serve for this purpose without
>>>>>> introducing breaking changes. So I'm +1 on the updated SPIP. It fairly
>>>>>> separated necessary GPU support from risky scheduler changes.
>>>>>>
>>>>>> On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra <m...@clearstorydata.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Of course there is an issue of the perfect becoming the enemy of the
>>>>>>> good, so I can understand the impulse to get something done. I am left
>>>>>>> wanting, however, at least something more of a roadmap to a task-level
>>>>>>> future than just a vague "we may choose to do something more in the
>>>>>>> future." At the risk of repeating myself, I don't think the
>>>>>>> existing spark.task.cpus is very good, and I think that building more on
>>>>>>> that weak foundation without a more clear path or stated intention to 
>>>>>>> move
>>>>>>> to something better runs the risk of leaving Spark stuck in a bad
>>>>>>> neighborhood.
>>>>>>>
>>>>>>> On Thu, Mar 21, 2019 at 10:10 AM Tom Graves <tgraves...@yahoo.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> While I agree with you that it would be ideal to have the task
>>>>>>>> level resources and do a deeper redesign for the scheduler, I think 
>>>>>>>> that
>>>>>>>> can be a separate enhancement like was discussed earlier in the thread.
>>>>>>>> That feature is useful without GPU's.  I do realize that they overlap 
>>>>>>>> some
>>>>>>>> but I think the changes for this will be minimal to the scheduler, 
>>>>>>>> follow
>>>>>>>> existing conventions, and it is an improvement over what we have now. I
>>>>>>>> know many users will be happy to have this even without the task level
>>>>>>>> scheduling as many of the conventions used now to scheduler gpus can 
>>>>>>>> easily
>>>>>>>> be broken by one bad user.     I think from the user point of view this
>>>>>>>> gives many users an improvement and we can extend it later to cover 
>>>>>>>> more
>>>>>>>> use cases.
>>>>>>>>
>>>>>>>> Tom
>>>>>>>> On Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Hamstra <
>>>>>>>> m...@clearstorydata.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> I understand the application-level, static, global nature
>>>>>>>> of spark.task.accelerator.gpu.count and its similarity to the
>>>>>>>> existing spark.task.cpus, but to me this feels like extending a 
>>>>>>>> weakness of
>>>>>>>> Spark's scheduler, not building on its strengths. That is because I
>>>>>>>> consider binding the number of cores for each task to an application
>>>>>>>> configuration to be far from optimal. This is already far from the 
>>>>>>>> desired
>>>>>>>> behavior when an application is running a wide range of jobs (as in a
>>>>>>>> generic job-runner style of Spark application), some of which require 
>>>>>>>> or
>>>>>>>> can benefit from multi-core tasks, others of which will just waste the
>>>>>>>> extra cores allocated to their tasks. Ideally, the number of cores
>>>>>>>> allocated to tasks would get pushed to an even finer granularity that 
>>>>>>>> jobs,
>>>>>>>> and instead being a per-stage property.
>>>>>>>>
>>>>>>>> Now, of course, making allocation of general-purpose cores and
>>>>>>>> domain-specific resources work in this finer-grained fashion is a lot 
>>>>>>>> more
>>>>>>>> work than just trying to extend the existing resource allocation 
>>>>>>>> mechanisms
>>>>>>>> to handle domain-specific resources, but it does feel to me like we 
>>>>>>>> should
>>>>>>>> at least be considering doing that deeper redesign.
>>>>>>>>
>>>>>>>> On Thu, Mar 21, 2019 at 7:33 AM Tom Graves
>>>>>>>> <tgraves...@yahoo.com.invalid> wrote:
>>>>>>>>
>>>>>>>> Tthe proposal here is that all your resources are static and the
>>>>>>>> gpu per task config is global per application, meaning you ask for a
>>>>>>>> certain amount memory, cpu, GPUs for every executor up front just like 
>>>>>>>> you
>>>>>>>> do today and every executor you get is that size.  This means that both
>>>>>>>> static or dynamic allocation still work without explicitly adding more
>>>>>>>> logic at this point. Since the config for gpu per task is global it 
>>>>>>>> means
>>>>>>>> every task you want will need a certain ratio of cpu to gpu.  Since 
>>>>>>>> that is
>>>>>>>> a global you can't really have the scenario you mentioned, all tasks 
>>>>>>>> are
>>>>>>>> assuming to need GPU.  For instance. I request 5 cores, 2 GPUs, set 1 
>>>>>>>> gpu
>>>>>>>> per task for each executor.  That means that I could only run 2 tasks 
>>>>>>>> and 3
>>>>>>>> cores would be wasted.  The stage/task level configuration of 
>>>>>>>> resources was
>>>>>>>> removed and is something we can do in a separate SPIP.
>>>>>>>> We thought erroring would make it more obvious to the user.  We
>>>>>>>> could change this to a warning if everyone thinks that is better but I
>>>>>>>> personally like the error until we can implement the per lower level 
>>>>>>>> per
>>>>>>>> stage configuration.
>>>>>>>>
>>>>>>>> Tom
>>>>>>>>
>>>>>>>> On Thursday, March 21, 2019, 1:45:01 AM PDT, Marco Gaido <
>>>>>>>> marcogaid...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for this SPIP.
>>>>>>>> I cannot comment on the docs, but just wanted to highlight one
>>>>>>>> thing. In page 5 of the SPIP, when we talk about DRA, I see:
>>>>>>>>
>>>>>>>> "For instance, if each executor consists 4 CPUs and 2 GPUs, and
>>>>>>>> each task requires 1 CPU and 1GPU, then we shall throw an error on
>>>>>>>> application start because we shall always have at least 2 idle CPUs per
>>>>>>>> executor"
>>>>>>>>
>>>>>>>> I am not sure this is a correct behavior. We might have tasks
>>>>>>>> requiring only CPU running in parallel as well, hence that may make 
>>>>>>>> sense.
>>>>>>>> I'd rather emit a WARN or something similar. Anyway we just said we 
>>>>>>>> will
>>>>>>>> keep GPU scheduling on task level out of scope for the moment, right?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Marco
>>>>>>>>
>>>>>>>> Il giorno gio 21 mar 2019 alle ore 01:26 Xiangrui Meng <
>>>>>>>> m...@databricks.com> ha scritto:
>>>>>>>>
>>>>>>>> Steve, the initial work would focus on GPUs, but we will keep the
>>>>>>>> interfaces general to support other accelerators in the future. This 
>>>>>>>> was
>>>>>>>> mentioned in the SPIP and draft design.
>>>>>>>>
>>>>>>>> Imran, you should have comment permission now. Thanks for making a
>>>>>>>> pass! I don't think the proposed 3.0 features should block Spark 3.0
>>>>>>>> release either. It is just an estimate of what we could deliver. I will
>>>>>>>> update the doc to make it clear.
>>>>>>>>
>>>>>>>> Felix, it would be great if you can review the updated docs and let
>>>>>>>> us know your feedback.
>>>>>>>>
>>>>>>>> ** How about setting a tentative vote closing time to next Tue (Mar
>>>>>>>> 26)?
>>>>>>>>
>>>>>>>> On Wed, Mar 20, 2019 at 11:01 AM Imran Rashid <im...@therashids.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Thanks for sending the updated docs.  Can you please give everyone
>>>>>>>> the ability to comment?  I have some comments, but overall I think 
>>>>>>>> this is
>>>>>>>> a good proposal and addresses my prior concerns.
>>>>>>>>
>>>>>>>> My only real concern is that I notice some mention of "must dos"
>>>>>>>> for spark 3.0.  I don't want to make any commitment to holding spark 
>>>>>>>> 3.0
>>>>>>>> for parts of this, I think that is an entirely separate decision.  
>>>>>>>> However
>>>>>>>> I'm guessing this is just a minor wording issue, and you really mean 
>>>>>>>> that's
>>>>>>>> a minimal set of features you are aiming for, which is reasonable.
>>>>>>>>
>>>>>>>> On Mon, Mar 18, 2019 at 12:56 PM Xingbo Jiang <
>>>>>>>> jiangxb1...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I updated the SPIP doc
>>>>>>>> <https://docs.google.com/document/d/1C4J_BPOcSCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit#>
>>>>>>>> and stories
>>>>>>>> <https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit#heading=h.udyua28eu3sg>,
>>>>>>>> I hope it now contains clear scope of the changes and enough details 
>>>>>>>> for
>>>>>>>> SPIP vote.
>>>>>>>> Please review the updated docs, thanks!
>>>>>>>>
>>>>>>>> Xiangrui Meng <men...@gmail.com> 于2019年3月6日周三 上午8:35写道：
>>>>>>>>
>>>>>>>> How about letting Xingbo make a major revision to the SPIP doc to
>>>>>>>> make it clear what proposed are? I like Felix's suggestion to switch 
>>>>>>>> to the
>>>>>>>> new Heilmeier template, which helps clarify what are proposed and what 
>>>>>>>> are
>>>>>>>> not. Then let's review the new SPIP and resume the vote.
>>>>>>>>
>>>>>>>> On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid <im...@therashids.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> OK, I suppose then we are getting bogged down into what a vote on
>>>>>>>> an SPIP means then anyway, which I guess we can set aside for now.  
>>>>>>>> With
>>>>>>>> the level of detail in this proposal, I feel like there is a reasonable
>>>>>>>> chance I'd still -1 the design or implementation.
>>>>>>>>
>>>>>>>> And the other thing you're implicitly asking the community for is
>>>>>>>> to prioritize this feature for continued review and maintenance.  
>>>>>>>> There is
>>>>>>>> already work to be done in things like making barrier mode support 
>>>>>>>> dynamic
>>>>>>>> allocation (SPARK-24942), bugs in failure handling (eg. SPARK-25250), 
>>>>>>>> and
>>>>>>>> general efficiency of failure handling (eg. SPARK-25341, SPARK-20178). 
>>>>>>>>  I'm
>>>>>>>> very concerned about getting spread too thin.
>>>>>>>>
>>>>>>>>
>>>>>>>> But if this is really just a vote on (1) is better gpu support
>>>>>>>> important for spark, in some form, in some release? and (2) is it
>>>>>>>> *possible* to do this in a safe way?  then I will vote +0.
>>>>>>>>
>>>>>>>> On Tue, Mar 5, 2019 at 8:25 AM Tom Graves <tgraves...@yahoo.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> So to me most of the questions here are implementation/design
>>>>>>>> questions, I've had this issue in the past with SPIP's where I 
>>>>>>>> expected to
>>>>>>>> have more high level design details but was basically told that 
>>>>>>>> belongs in
>>>>>>>> the design jira follow on. This makes me think we need to revisit what 
>>>>>>>> a
>>>>>>>> SPIP really need to contain, which should be done in a separate thread.
>>>>>>>> Note personally I would be for having more high level details in it.
>>>>>>>> But the way I read our documentation on a SPIP right now that
>>>>>>>> detail is all optional, now maybe we could argue its based on what
>>>>>>>> reviewers request, but really perhaps we should make the wording of 
>>>>>>>> that
>>>>>>>> more required.  thoughts?  We should probably separate that discussion 
>>>>>>>> if
>>>>>>>> people want to talk about that.
>>>>>>>>
>>>>>>>> For this SPIP in particular the reason I +1 it is because it came
>>>>>>>> down to 2 questions:
>>>>>>>>
>>>>>>>> 1) do I think spark should support this -> my answer is yes, I
>>>>>>>> think this would improve spark, users have been requesting both better 
>>>>>>>> GPUs
>>>>>>>> support and support for controlling container requests at a finer
>>>>>>>> granularity for a while.  If spark doesn't support this then users may 
>>>>>>>> go
>>>>>>>> to something else, so I think it we should support it
>>>>>>>>
>>>>>>>> 2) do I think its possible to design and implement it without
>>>>>>>> causing large instabilities?   My opinion here again is yes. I agree 
>>>>>>>> with
>>>>>>>> Imran and others that the scheduler piece needs to be looked at very
>>>>>>>> closely as we have had a lot of issues there and that is why I was 
>>>>>>>> asking
>>>>>>>> for more details in the design jira:
>>>>>>>> https://issues.apache.org/jira/browse/SPARK-27005.  But I do
>>>>>>>> believe its possible to do.
>>>>>>>>
>>>>>>>> If others have reservations on similar questions then I think we
>>>>>>>> should resolve here or take the discussion of what a SPIP is to a 
>>>>>>>> different
>>>>>>>> thread and then come back to this, thoughts?
>>>>>>>>
>>>>>>>> Note there is a high level design for at least the core piece,
>>>>>>>> which is what people seem concerned with, already so including it in 
>>>>>>>> the
>>>>>>>> SPIP should be straight forward.
>>>>>>>>
>>>>>>>> Tom
>>>>>>>>
>>>>>>>> On Monday, March 4, 2019, 2:52:43 PM CST, Imran Rashid <
>>>>>>>> im...@therashids.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng <men...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung <
>>>>>>>> felixcheun...@hotmail.com> wrote:
>>>>>>>>
>>>>>>>> IMO upfront allocation is less useful. Specifically too expensive
>>>>>>>> for large jobs.
>>>>>>>>
>>>>>>>>
>>>>>>>> This is also an API/design discussion.
>>>>>>>>
>>>>>>>>
>>>>>>>> I agree with Felix -- this is more than just an API question.  It
>>>>>>>> has a huge impact on the complexity of what you're proposing.  You 
>>>>>>>> might be
>>>>>>>> proposing big changes to a core and brittle part of spark, which is 
>>>>>>>> already
>>>>>>>> short of experts.
>>>>>>>>
>>>>>>>> I don't see any value in having a vote on "does feature X sound
>>>>>>>> cool?"  We have to evaluate the potential benefit against the risks the
>>>>>>>> feature brings and the continued maintenance cost.  We don't need super
>>>>>>>> low-level details, but we have to a sketch of the design to be able to 
>>>>>>>> make
>>>>>>>> that tradeoff.
>>>>>>>>
>>>>>>>>

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Reply via email to