Thanks all for the input here! @Hans van den Bogert,
Yes, agree with Alex R, Mesos is now using coarse grained mode to allocate resources and the minimum unit is a single host, so you will always get cpu and memory. @Alex, Yes, I was only listing sorters here, ideally, I think that an indeal allocation sequence should be: 1) Allocate quota non scarce resources 2) Allocate quota scarce resources 3) Allocate reserved non scarce resources 4) Allocate reserved scarce resources 5) Allocate revocable non scarce resources 6) Allocate revocable scarce resources Regarding to "requestResources", I think that even we implement it, the scarce resources will still impact the WDRF sorter as Ben M pointed out in his use cases. An ideal solution would be "exclude scarce resources from sorter" plus "requestResources" for scarce resources. The "exclude scarce resources from sorter" will focus on non scarce resources while "requestResources" focus on scarce resources. I can see that till now, we have three solutions to handle scarce resources: 1) Ben M: Create sub-pools of resources based on machine profile and perform fair sharing / quota within each pool plus a framework capability GPU_AWARE to enable allocator filter out scarce resources for some frameworks. 2) Guangya: Adding new sorters for non scarce resources plus a framework capability GPU_AWARE to enable allocator filter out scarce resources for some frameworks. 3) Alex R: "requestResources" for scarce resource plus "exclude scarce resource from sorter" for non scarce resources (@Alex R, I was putting "exclude scarce resource from sorter" to your proposal, hope it is OK?) Solution 1) may cause low resource utilization as Ben M point out. Both 2) and 3) still using resources in a single pool, so the resource utilization will not be impacted. For solution 2), I did not have strong intention to say which one is better. For 2), my only concern is not sure many sorters can cause performance issue, but as we should assume there are not too many scarce resources in the cluster, so the performance should not impact much even if we add another 3 sorters for scarce resources. For solution 3), the only problem for "requestResource" is that it may lead to the issue of "greedy framework" consume all resources, we may need to consider enabling "requestResource" only request scarce resources first so as to reduce the impact of some "greedy frameworks". Another problem for solution 1) and 2) is we need to introduce framework capability for each specified scarce resource to enable allocator filter out the scarce resources when a new resources appeared, but I think that this will not impact much as we should not have too many scarce resources in the future due to those are "scarce resources". @Fan Du, Currently, I think that the scarce resources should be defined by cluster admin, s/he can specify those scarce resources via a flag when master start up. Regarding to the proposal of generic scarce resources, do you have any thoughts on this? I can see that giving framework developers the options of define scarce resources may bring trouble to mesos, it is better to let mesos define those scarce but not framework developer. Thanks, Guangya On Fri, Jun 17, 2016 at 6:53 AM, Joris Van Remoortere <jo...@mesosphere.io> wrote: > @Fan, > > In the community meeting a question was raised around which frameworks > might be ready to use this. > Can you provide some more context for immediate use cases on the framework > side? > > > — > *Joris Van Remoortere* > Mesosphere > > On Fri, Jun 17, 2016 at 12:51 AM, Du, Fan <fan...@intel.com> wrote: > > > A couple of rough thoughts in the early morning: > > > > a. Is there any quantitative way to decide a resource is kind of scare? I > > mean how to aid operator to make this decision to use/not use this > > functionality when deploying mesos. > > > > b. Scare resource extend from GPU to, name a few, Xeon Phi, FPGA, what > > about make the proposal more generic and future proof? > > > > > > > > On 2016/6/11 10:50, Benjamin Mahler wrote: > > > >> I wanted to start a discussion about the allocation of "scarce" > resources. > >> "Scarce" in this context means resources that are not present on every > >> machine. GPUs are the first example of a scarce resource that we support > >> as > >> a known resource type. > >> > >> Consider the behavior when there are the following agents in a cluster: > >> > >> 999 agents with (cpus:4,mem:1024,disk:1024) > >> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024) > >> > >> Here there are 1000 machines but only 1 has GPUs. We call GPUs a > "scarce" > >> resource here because they are only present on a small percentage of the > >> machines. > >> > >> We end up with some problematic behavior here with our current > allocation > >> model: > >> > >> (1) If a role wishes to use both GPU and non-GPU resources for > tasks, > >> consuming 1 GPU will lead DRF to consider the role to have a 100% share > of > >> the cluster, since it consumes 100% of the GPUs in the cluster. This > >> framework will then not receive any other offers. > >> > >> (2) Because we do not have revocation yet, if a framework decides > to > >> consume the non-GPU resources on a GPU machine, it will prevent the GPU > >> workloads from running! > >> > >> -------- > >> > >> I filed an epic [1] to track this. The plan for the short-term is to > >> introduce two mechanisms to mitigate these issues: > >> > >> -Introduce a resource fairness exclusion list. This allows the > shares > >> of resources like "gpus" to be excluded from the dominant share. > >> > >> -Introduce a GPU_AWARE framework capability. This indicates that > the > >> scheduler is aware of GPUs and will schedule tasks accordingly. Old > >> schedulers will not have the capability and will not receive any offers > >> for > >> GPU machines. If a scheduler has the capability, we'll advise that they > >> avoid placing their additional non-GPU workloads on the GPU machines. > >> > >> -------- > >> > >> Longer term, we'll want a more robust way to manage scarce resources. > The > >> first thought we had was to have sub-pools of resources based on machine > >> profile and perform fair sharing / quota within each pool. This > addresses > >> (1) cleanly, and for (2) the operator needs to explicitly disallow > non-GPU > >> frameworks from participating in the GPU pool. > >> > >> Unfortunately, by excluding non-GPU frameworks from the GPU pool we may > >> have a lower level of utilization. In the even longer term, as we add > >> revocation it will be possible to allow a scheduler desiring GPUs to > >> revoke > >> the resources allocated to the non-GPU workloads running on the GPU > >> machines. There are a number of things we need to put in place to > support > >> revocation ([2], [3], [4], etc), so I'm glossing over the details here. > >> > >> If anyone has any thoughts or insight in this area, please share! > >> > >> Ben > >> > >> [1] https://issues.apache.org/jira/browse/MESOS-5377 > >> [2] https://issues.apache.org/jira/browse/MESOS-5524 > >> [3] https://issues.apache.org/jira/browse/MESOS-5527 > >> [4] https://issues.apache.org/jira/browse/MESOS-4392 > >> > >> >