Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Xiangrui Meng Mon, 25 Mar 2019 19:39:52 -0700

There are certainly use cases where different stages require different
number of CPUs or GPUs under an optimal setting. I don't think anyone
disagrees that ideally users should be able to do it. We are just dealing
with typical engineering trade-offs and see how we break it down into
smaller ones. I think it is fair to treat the task-level resource request
as a separate feature here because it also applies to CPUs alone without
GPUs, as Tom mentioned above. But having "spark.task.cpus" only for many
years Spark is still able to cover many many use cases. Otherwise we
shouldn't see many Spark users around now. Here we just apply similar
arguments to GPUs.


Initially, I was the person who really wanted task-level requests because
it is ideal. In an offline discussion, Andy Feng pointed out an
application-level setting should fit common deep learning training and
inference cases and it greatly simplifies necessary changes required to
Spark job scheduler. With Imran's feedback to the initial design sketch,
the application-level approach became my first choice because it is still
very valuable but much less risky. If a feature brings great value to
users, we should add it even it is not ideal.

Back to the default value discussion, let's forget GPUs and only consider
CPUs. Would an application-level default number of CPU cores disappear if
we added task-level requests? If yes, does it mean that users have to
explicitly state the resource requirements for every single stage? It is
tedious to do and who do not fully understand the impact would probably do
it wrong and waste even more resources. Then how many cores each task
should use if user didn't specify it? I do see "spark.task.cpus" is the
answer here. The point I want to make is that "spark.task.cpus", though
less ideal, is still needed when we have task-level requests for CPUs.

On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra <[email protected]>
wrote:

> I remain unconvinced that a default configuration at the application level
> makes sense even in that case. There may be some applications where you
> know a priori that almost all the tasks for all the stages for all the jobs
> will need some fixed number of gpus; but I think the more common cases will
> be dynamic configuration at the job or stage level. Stage level could have
> a lot of overlap with barrier mode scheduling -- barrier mode stages having
> a need for an inter-task channel resource, gpu-ified stages needing gpu
> resources, etc. Have I mentioned that I'm not a fan of the current barrier
> mode API, Xiangrui? :) Yes, I know: "Show me something better."
>
> On Mon, Mar 25, 2019 at 3:55 PM Xiangrui Meng <[email protected]> wrote:
>
>> Say if we support per-task resource requests in the future, it would be
>> still inconvenient for users to declare the resource requirements for every
>> single task/stage. So there must be some default values defined somewhere
>> for task resource requirements. "spark.task.cpus" and
>> "spark.task.accelerator.gpu.count" could serve for this purpose without
>> introducing breaking changes. So I'm +1 on the updated SPIP. It fairly
>> separated necessary GPU support from risky scheduler changes.
>>
>> On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra <[email protected]>
>> wrote:
>>
>>> Of course there is an issue of the perfect becoming the enemy of the
>>> good, so I can understand the impulse to get something done. I am left
>>> wanting, however, at least something more of a roadmap to a task-level
>>> future than just a vague "we may choose to do something more in the
>>> future." At the risk of repeating myself, I don't think the
>>> existing spark.task.cpus is very good, and I think that building more on
>>> that weak foundation without a more clear path or stated intention to move
>>> to something better runs the risk of leaving Spark stuck in a bad
>>> neighborhood.
>>>
>>> On Thu, Mar 21, 2019 at 10:10 AM Tom Graves <[email protected]>
>>> wrote:
>>>
>>>> While I agree with you that it would be ideal to have the task level
>>>> resources and do a deeper redesign for the scheduler, I think that can be a
>>>> separate enhancement like was discussed earlier in the thread. That feature
>>>> is useful without GPU's.  I do realize that they overlap some but I think
>>>> the changes for this will be minimal to the scheduler, follow existing
>>>> conventions, and it is an improvement over what we have now. I know many
>>>> users will be happy to have this even without the task level scheduling as
>>>> many of the conventions used now to scheduler gpus can easily be broken by
>>>> one bad user.     I think from the user point of view this gives many users
>>>> an improvement and we can extend it later to cover more use cases.
>>>>
>>>> Tom
>>>> On Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Hamstra <
>>>> [email protected]> wrote:
>>>>
>>>>
>>>> I understand the application-level, static, global nature
>>>> of spark.task.accelerator.gpu.count and its similarity to the
>>>> existing spark.task.cpus, but to me this feels like extending a weakness of
>>>> Spark's scheduler, not building on its strengths. That is because I
>>>> consider binding the number of cores for each task to an application
>>>> configuration to be far from optimal. This is already far from the desired
>>>> behavior when an application is running a wide range of jobs (as in a
>>>> generic job-runner style of Spark application), some of which require or
>>>> can benefit from multi-core tasks, others of which will just waste the
>>>> extra cores allocated to their tasks. Ideally, the number of cores
>>>> allocated to tasks would get pushed to an even finer granularity that jobs,
>>>> and instead being a per-stage property.
>>>>
>>>> Now, of course, making allocation of general-purpose cores and
>>>> domain-specific resources work in this finer-grained fashion is a lot more
>>>> work than just trying to extend the existing resource allocation mechanisms
>>>> to handle domain-specific resources, but it does feel to me like we should
>>>> at least be considering doing that deeper redesign.
>>>>
>>>> On Thu, Mar 21, 2019 at 7:33 AM Tom Graves <[email protected]>
>>>> wrote:
>>>>
>>>> Tthe proposal here is that all your resources are static and the gpu
>>>> per task config is global per application, meaning you ask for a certain
>>>> amount memory, cpu, GPUs for every executor up front just like you do today
>>>> and every executor you get is that size.  This means that both static or
>>>> dynamic allocation still work without explicitly adding more logic at this
>>>> point. Since the config for gpu per task is global it means every task you
>>>> want will need a certain ratio of cpu to gpu.  Since that is a global you
>>>> can't really have the scenario you mentioned, all tasks are assuming to
>>>> need GPU.  For instance. I request 5 cores, 2 GPUs, set 1 gpu per task for
>>>> each executor.  That means that I could only run 2 tasks and 3 cores would
>>>> be wasted.  The stage/task level configuration of resources was removed and
>>>> is something we can do in a separate SPIP.
>>>> We thought erroring would make it more obvious to the user.  We could
>>>> change this to a warning if everyone thinks that is better but I personally
>>>> like the error until we can implement the per lower level per stage
>>>> configuration.
>>>>
>>>> Tom
>>>>
>>>> On Thursday, March 21, 2019, 1:45:01 AM PDT, Marco Gaido <
>>>> [email protected]> wrote:
>>>>
>>>>
>>>> Thanks for this SPIP.
>>>> I cannot comment on the docs, but just wanted to highlight one thing.
>>>> In page 5 of the SPIP, when we talk about DRA, I see:
>>>>
>>>> "For instance, if each executor consists 4 CPUs and 2 GPUs, and each
>>>> task requires 1 CPU and 1GPU, then we shall throw an error on application
>>>> start because we shall always have at least 2 idle CPUs per executor"
>>>>
>>>> I am not sure this is a correct behavior. We might have tasks requiring
>>>> only CPU running in parallel as well, hence that may make sense. I'd rather
>>>> emit a WARN or something similar. Anyway we just said we will keep GPU
>>>> scheduling on task level out of scope for the moment, right?
>>>>
>>>> Thanks,
>>>> Marco
>>>>
>>>> Il giorno gio 21 mar 2019 alle ore 01:26 Xiangrui Meng <
>>>> [email protected]> ha scritto:
>>>>
>>>> Steve, the initial work would focus on GPUs, but we will keep the
>>>> interfaces general to support other accelerators in the future. This was
>>>> mentioned in the SPIP and draft design.
>>>>
>>>> Imran, you should have comment permission now. Thanks for making a
>>>> pass! I don't think the proposed 3.0 features should block Spark 3.0
>>>> release either. It is just an estimate of what we could deliver. I will
>>>> update the doc to make it clear.
>>>>
>>>> Felix, it would be great if you can review the updated docs and let us
>>>> know your feedback.
>>>>
>>>> ** How about setting a tentative vote closing time to next Tue (Mar 26)?
>>>>
>>>> On Wed, Mar 20, 2019 at 11:01 AM Imran Rashid <[email protected]>
>>>> wrote:
>>>>
>>>> Thanks for sending the updated docs.  Can you please give everyone the
>>>> ability to comment?  I have some comments, but overall I think this is a
>>>> good proposal and addresses my prior concerns.
>>>>
>>>> My only real concern is that I notice some mention of "must dos" for
>>>> spark 3.0.  I don't want to make any commitment to holding spark 3.0 for
>>>> parts of this, I think that is an entirely separate decision.  However I'm
>>>> guessing this is just a minor wording issue, and you really mean that's a
>>>> minimal set of features you are aiming for, which is reasonable.
>>>>
>>>> On Mon, Mar 18, 2019 at 12:56 PM Xingbo Jiang <[email protected]>
>>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I updated the SPIP doc
>>>> <https://docs.google.com/document/d/1C4J_BPOcSCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit#>
>>>> and stories
>>>> <https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit#heading=h.udyua28eu3sg>,
>>>> I hope it now contains clear scope of the changes and enough details for
>>>> SPIP vote.
>>>> Please review the updated docs, thanks!
>>>>
>>>> Xiangrui Meng <[email protected]> 于2019年3月6日周三 上午8:35写道：
>>>>
>>>> How about letting Xingbo make a major revision to the SPIP doc to make
>>>> it clear what proposed are? I like Felix's suggestion to switch to the new
>>>> Heilmeier template, which helps clarify what are proposed and what are not.
>>>> Then let's review the new SPIP and resume the vote.
>>>>
>>>> On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid <[email protected]>
>>>> wrote:
>>>>
>>>> OK, I suppose then we are getting bogged down into what a vote on an
>>>> SPIP means then anyway, which I guess we can set aside for now.  With the
>>>> level of detail in this proposal, I feel like there is a reasonable chance
>>>> I'd still -1 the design or implementation.
>>>>
>>>> And the other thing you're implicitly asking the community for is to
>>>> prioritize this feature for continued review and maintenance.  There is
>>>> already work to be done in things like making barrier mode support dynamic
>>>> allocation (SPARK-24942), bugs in failure handling (eg. SPARK-25250), and
>>>> general efficiency of failure handling (eg. SPARK-25341, SPARK-20178).  I'm
>>>> very concerned about getting spread too thin.
>>>>
>>>>
>>>> But if this is really just a vote on (1) is better gpu support
>>>> important for spark, in some form, in some release? and (2) is it
>>>> *possible* to do this in a safe way?  then I will vote +0.
>>>>
>>>> On Tue, Mar 5, 2019 at 8:25 AM Tom Graves <[email protected]> wrote:
>>>>
>>>> So to me most of the questions here are implementation/design
>>>> questions, I've had this issue in the past with SPIP's where I expected to
>>>> have more high level design details but was basically told that belongs in
>>>> the design jira follow on. This makes me think we need to revisit what a
>>>> SPIP really need to contain, which should be done in a separate thread.
>>>> Note personally I would be for having more high level details in it.
>>>> But the way I read our documentation on a SPIP right now that detail is
>>>> all optional, now maybe we could argue its based on what reviewers request,
>>>> but really perhaps we should make the wording of that more required.
>>>>  thoughts?  We should probably separate that discussion if people want to
>>>> talk about that.
>>>>
>>>> For this SPIP in particular the reason I +1 it is because it came down
>>>> to 2 questions:
>>>>
>>>> 1) do I think spark should support this -> my answer is yes, I think
>>>> this would improve spark, users have been requesting both better GPUs
>>>> support and support for controlling container requests at a finer
>>>> granularity for a while.  If spark doesn't support this then users may go
>>>> to something else, so I think it we should support it
>>>>
>>>> 2) do I think its possible to design and implement it without causing
>>>> large instabilities?   My opinion here again is yes. I agree with Imran and
>>>> others that the scheduler piece needs to be looked at very closely as we
>>>> have had a lot of issues there and that is why I was asking for more
>>>> details in the design jira:
>>>> https://issues.apache.org/jira/browse/SPARK-27005.  But I do believe
>>>> its possible to do.
>>>>
>>>> If others have reservations on similar questions then I think we should
>>>> resolve here or take the discussion of what a SPIP is to a different thread
>>>> and then come back to this, thoughts?
>>>>
>>>> Note there is a high level design for at least the core piece, which is
>>>> what people seem concerned with, already so including it in the SPIP should
>>>> be straight forward.
>>>>
>>>> Tom
>>>>
>>>> On Monday, March 4, 2019, 2:52:43 PM CST, Imran Rashid <
>>>> [email protected]> wrote:
>>>>
>>>>
>>>> On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng <[email protected]> wrote:
>>>>
>>>> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung <[email protected]>
>>>> wrote:
>>>>
>>>> IMO upfront allocation is less useful. Specifically too expensive for
>>>> large jobs.
>>>>
>>>>
>>>> This is also an API/design discussion.
>>>>
>>>>
>>>> I agree with Felix -- this is more than just an API question.  It has a
>>>> huge impact on the complexity of what you're proposing.  You might be
>>>> proposing big changes to a core and brittle part of spark, which is already
>>>> short of experts.
>>>>
>>>> I don't see any value in having a vote on "does feature X sound cool?"
>>>> We have to evaluate the potential benefit against the risks the feature
>>>> brings and the continued maintenance cost.  We don't need super low-level
>>>> details, but we have to a sketch of the design to be able to make that
>>>> tradeoff.
>>>>
>>>>

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Reply via email to