+1 for leveraging `reqouestResources. I've also toyed this idea with allocator groups offline. IMO Giving schedulers a way to specify resource envelope size and/or constraints is a easier way to manage the resources.
On Thu, Jun 16, 2016 at 9:39 AM, Alex Rukletsov <a...@mesosphere.com> wrote: > We definitely don't want a 2-step scenario. In this case, a framework may > not be able to launch its tasks on GPU resources, while still holding them. > > However, having a dedicated sorter for scarce resources does not mean we > should allocate them separately. Also, I'm not sure Guangya intended to > enumerate allocation stages, it looks like he simply listed the sorters. I > don't see why we may want to allocate scarce resources after allocating > revocable. > > I think a more important question is how to effectively offer scarce > resources to frameworks that are interested in them. Maybe we can leverage > `requestResources`? > > On Thu, Jun 16, 2016 at 5:22 PM, Hans van den Bogert <hansbog...@gmail.com > > > wrote: > > > Hi all, > > > > Maybe I’m missing context info on how something like a GPU as a resource > > should work, but I assume that the general scenario would be that the GPU > > host application would still need memory and cpu(s) co-located on the > node. > > In the case of, > > > 4) scarceSorter include 1 agent with (gpus:1) > > > > If I understand your meaning correctly, this round of offers would, in > > this case, only consists of the GPU resource. Is it then up to the > > framework to figure out it will also need cpu and memory on the agent’s > > node? If so, it would need at least another offer for that node to make > the > > GPU resource useful. Such a 2-step offer/accept is rather cumbersome. > > > > Regards, > > > > Hans > > > > > > > > > On Jun 16, 2016, at 11:26 AM, Guangya Liu <gyliu...@gmail.com> wrote: > > > > > > Hi Ben, > > > > > > The pre-condition for four stage allocation is that we need to put > > > different resources to different sorters: > > > > > > 1) roleSorter only include non scarce resources. > > > 2) quotaRoleSorter only include non revocable & non scarce resources. > > > 3) revocableSorter only include revocable & non scarce resources. This > > will > > > be handled in MESOS-4923 < > > https://issues.apache.org/jira/browse/MESOS-4923> > > > 4) scarceSorter only include scarce resources. > > > > > > Take your case above: > > > 999 agents with (cpus:4,mem:1024,disk:1024) > > > 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024) > > > > > > The four sorters would be: > > > 1) roleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024) > > > 2) quotaRoleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024) > > > 3) revocableSorter include nothing as I have no revocable resources > here. > > > 4) scarceSorter include 1 agent with (gpus:1) > > > > > > When allocate resources, even if a role got the agent with gpu > resources, > > > its share will only be counter by scarceSorter but not other sorters, > and > > > will not impact other sorters. > > > > > > The above solution is actually kind of enhancement to "exclude scarce > > > resources" as the scarce resources also obey the DRF algorithm with > this. > > > > > > This solution can be also treated as diving the whole resources pool > > > logically to scarce and non scarce resource pool. 1), 2) and 3) will > > handle > > > non scarce resources while 4) focus on scarce resources. > > > > > > Thanks, > > > > > > Guangya > > > > > > On Thu, Jun 16, 2016 at 2:10 AM, Benjamin Mahler <bmah...@apache.org> > > wrote: > > > > > >> Hm.. can you expand on how adding another allocation stage for only > > scarce > > >> resources would behave well? It seems to have a number of problems > when > > I > > >> think through it. > > >> > > >> On Sat, Jun 11, 2016 at 7:59 AM, Guangya Liu <gyliu...@gmail.com> > > wrote: > > >> > > >>> Hi Ben, > > >>> > > >>> For long term goal, instead of creating sub-pool, what about adding a > > new > > >>> sorter to handle **scare** resources? The current logic in allocator > > was > > >>> divided to two stages: allocation for quota, allocation for non quota > > >>> resources. > > >>> > > >>> I think that the future logic in allocator would be divided to four > > >>> stages: > > >>> 1) allocation for quota > > >>> 2) allocation for reserved resources > > >>> 3) allocation for revocable resources > > >>> 4) allocation for scare resources > > >>> > > >>> Thanks, > > >>> > > >>> Guangy > > >>> > > >>> On Sat, Jun 11, 2016 at 10:50 AM, Benjamin Mahler < > bmah...@apache.org> > > >>> wrote: > > >>> > > >>>> I wanted to start a discussion about the allocation of "scarce" > > >>>> resources. "Scarce" in this context means resources that are not > > present on > > >>>> every machine. GPUs are the first example of a scarce resource that > we > > >>>> support as a known resource type. > > >>>> > > >>>> Consider the behavior when there are the following agents in a > > cluster: > > >>>> > > >>>> 999 agents with (cpus:4,mem:1024,disk:1024) > > >>>> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024) > > >>>> > > >>>> Here there are 1000 machines but only 1 has GPUs. We call GPUs a > > >>>> "scarce" resource here because they are only present on a small > > percentage > > >>>> of the machines. > > >>>> > > >>>> We end up with some problematic behavior here with our current > > >>>> allocation model: > > >>>> > > >>>> (1) If a role wishes to use both GPU and non-GPU resources for > > >>>> tasks, consuming 1 GPU will lead DRF to consider the role to have a > > 100% > > >>>> share of the cluster, since it consumes 100% of the GPUs in the > > cluster. > > >>>> This framework will then not receive any other offers. > > >>>> > > >>>> (2) Because we do not have revocation yet, if a framework decides > > to > > >>>> consume the non-GPU resources on a GPU machine, it will prevent the > > GPU > > >>>> workloads from running! > > >>>> > > >>>> -------- > > >>>> > > >>>> I filed an epic [1] to track this. The plan for the short-term is to > > >>>> introduce two mechanisms to mitigate these issues: > > >>>> > > >>>> -Introduce a resource fairness exclusion list. This allows the > > >>>> shares of resources like "gpus" to be excluded from the dominant > > share. > > >>>> > > >>>> -Introduce a GPU_AWARE framework capability. This indicates that > > the > > >>>> scheduler is aware of GPUs and will schedule tasks accordingly. Old > > >>>> schedulers will not have the capability and will not receive any > > offers for > > >>>> GPU machines. If a scheduler has the capability, we'll advise that > > they > > >>>> avoid placing their additional non-GPU workloads on the GPU > machines. > > >>>> > > >>>> -------- > > >>>> > > >>>> Longer term, we'll want a more robust way to manage scarce > resources. > > >>>> The first thought we had was to have sub-pools of resources based on > > >>>> machine profile and perform fair sharing / quota within each pool. > > This > > >>>> addresses (1) cleanly, and for (2) the operator needs to explicitly > > >>>> disallow non-GPU frameworks from participating in the GPU pool. > > >>>> > > >>>> Unfortunately, by excluding non-GPU frameworks from the GPU pool we > > may > > >>>> have a lower level of utilization. In the even longer term, as we > add > > >>>> revocation it will be possible to allow a scheduler desiring GPUs to > > revoke > > >>>> the resources allocated to the non-GPU workloads running on the GPU > > >>>> machines. There are a number of things we need to put in place to > > support > > >>>> revocation ([2], [3], [4], etc), so I'm glossing over the details > > here. > > >>>> > > >>>> If anyone has any thoughts or insight in this area, please share! > > >>>> > > >>>> Ben > > >>>> > > >>>> [1] https://issues.apache.org/jira/browse/MESOS-5377 > > >>>> [2] https://issues.apache.org/jira/browse/MESOS-5524 > > >>>> [3] https://issues.apache.org/jira/browse/MESOS-5527 > > >>>> [4] https://issues.apache.org/jira/browse/MESOS-4392 > > >>>> > > >>> > > >>> > > >> > > > > > -- Cheers, Zhitao Li