Thanks all for the input here!

@Hans van den Bogert,

Yes, agree with Alex R, Mesos is now using coarse grained mode to allocate
resources and the minimum unit is a single host, so you will always get cpu
and memory.


Yes, I was only listing sorters here, ideally, I think that an indeal
allocation sequence should be:

1) Allocate quota non scarce resources
2) Allocate quota scarce resources
3) Allocate reserved non scarce resources
4) Allocate reserved scarce resources
5) Allocate revocable non scarce resources
6) Allocate revocable scarce resources

Regarding to "requestResources", I think that even we implement it, the
scarce resources will still impact the WDRF sorter as Ben M pointed out in
his use cases.

An ideal solution would be "exclude scarce resources from sorter" plus
"requestResources" for scarce resources. The "exclude scarce resources from
sorter" will focus on non scarce resources while "requestResources" focus
on scarce resources.

I can see that till now, we have three solutions to handle scarce resources:
1) Ben M: Create sub-pools of resources based on machine profile and
perform fair sharing / quota within each pool plus a framework
capability GPU_AWARE
to enable allocator filter out scarce resources for some frameworks.
2) Guangya: Adding new sorters for non scarce resources plus a framework
capability GPU_AWARE to enable allocator filter out scarce resources for
some frameworks.
3) Alex R: "requestResources" for scarce resource plus "exclude scarce
resource from sorter" for non scarce resources (@Alex R, I was putting "exclude
scarce resource from sorter" to your proposal, hope it is OK?)

Solution 1) may cause low resource utilization as Ben M point out. Both 2)
and 3) still using resources in a single pool, so the resource utilization
will not be impacted.

For solution 2), I did not have strong intention to say which one is
better. For 2), my only concern is not sure many sorters can cause
performance issue, but as we should assume there are not too many scarce
resources in the cluster, so the performance should not impact much even if
we add another 3 sorters for scarce resources.

For solution 3), the only problem for "requestResource" is that it may lead
to the issue of "greedy framework" consume all resources, we may need to
consider enabling "requestResource" only request scarce resources first so
as to reduce the impact of some "greedy frameworks".

Another problem for solution 1) and 2) is we need to introduce framework
capability for each specified scarce resource to enable allocator filter
out the scarce resources when a new resources appeared, but I think that
this will not impact much as we should not have too many scarce resources
in the future due to those are "scarce resources".

@Fan Du,

Currently, I think that the scarce resources should be defined by cluster
admin, s/he can specify those scarce resources via a flag when master start

Regarding to the proposal of generic scarce resources, do you have any
thoughts on this? I can see that giving framework developers the options of
define scarce resources may bring trouble to mesos, it is better to let
mesos define those scarce but not framework developer.



On Fri, Jun 17, 2016 at 6:53 AM, Joris Van Remoortere <>

> @Fan,
> In the community meeting a question was raised around which frameworks
> might be ready to use this.
> Can you provide some more context for immediate use cases on the framework
> side?
> —
> *Joris Van Remoortere*
> Mesosphere
> On Fri, Jun 17, 2016 at 12:51 AM, Du, Fan <> wrote:
> > A couple of rough thoughts in the early morning:
> >
> > a. Is there any quantitative way to decide a resource is kind of scare? I
> > mean how to aid operator to make this decision to use/not use this
> > functionality when deploying mesos.
> >
> > b. Scare resource extend from GPU to, name a few, Xeon Phi, FPGA, what
> > about make the proposal more generic and future proof?
> >
> >
> >
> > On 2016/6/11 10:50, Benjamin Mahler wrote:
> >
> >> I wanted to start a discussion about the allocation of "scarce"
> resources.
> >> "Scarce" in this context means resources that are not present on every
> >> machine. GPUs are the first example of a scarce resource that we support
> >> as
> >> a known resource type.
> >>
> >> Consider the behavior when there are the following agents in a cluster:
> >>
> >> 999 agents with (cpus:4,mem:1024,disk:1024)
> >> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
> >>
> >> Here there are 1000 machines but only 1 has GPUs. We call GPUs a
> "scarce"
> >> resource here because they are only present on a small percentage of the
> >> machines.
> >>
> >> We end up with some problematic behavior here with our current
> allocation
> >> model:
> >>
> >>      (1) If a role wishes to use both GPU and non-GPU resources for
> tasks,
> >> consuming 1 GPU will lead DRF to consider the role to have a 100% share
> of
> >> the cluster, since it consumes 100% of the GPUs in the cluster. This
> >> framework will then not receive any other offers.
> >>
> >>      (2) Because we do not have revocation yet, if a framework decides
> to
> >> consume the non-GPU resources on a GPU machine, it will prevent the GPU
> >> workloads from running!
> >>
> >> --------
> >>
> >> I filed an epic [1] to track this. The plan for the short-term is to
> >> introduce two mechanisms to mitigate these issues:
> >>
> >>      -Introduce a resource fairness exclusion list. This allows the
> shares
> >> of resources like "gpus" to be excluded from the dominant share.
> >>
> >>      -Introduce a GPU_AWARE framework capability. This indicates that
> the
> >> scheduler is aware of GPUs and will schedule tasks accordingly. Old
> >> schedulers will not have the capability and will not receive any offers
> >> for
> >> GPU machines. If a scheduler has the capability, we'll advise that they
> >> avoid placing their additional non-GPU workloads on the GPU machines.
> >>
> >> --------
> >>
> >> Longer term, we'll want a more robust way to manage scarce resources.
> The
> >> first thought we had was to have sub-pools of resources based on machine
> >> profile and perform fair sharing / quota within each pool. This
> addresses
> >> (1) cleanly, and for (2) the operator needs to explicitly disallow
> non-GPU
> >> frameworks from participating in the GPU pool.
> >>
> >> Unfortunately, by excluding non-GPU frameworks from the GPU pool we may
> >> have a lower level of utilization. In the even longer term, as we add
> >> revocation it will be possible to allow a scheduler desiring GPUs to
> >> revoke
> >> the resources allocated to the non-GPU workloads running on the GPU
> >> machines. There are a number of things we need to put in place to
> support
> >> revocation ([2], [3], [4], etc), so I'm glossing over the details here.
> >>
> >> If anyone has any thoughts or insight in this area, please share!
> >>
> >> Ben
> >>
> >> [1]
> >> [2]
> >> [3]
> >> [4]
> >>
> >>

Reply via email to