Thanks Ben.M for initiative.

You already define the scarce resource and describe the problem clearly.
Just recap the requirements we want to resolved as follows

1) Scarce resource should not be treated as dominated resource during
allocating so that user could continue using non-scarce resource no matter
how many scarce resource he/she used.
2) Host with scarce resource should be firstly (or only) allocated to
frameworks who need scarce resource so that those frameworks will not be

However it also introduce new problems for example

How to enforce "fairness" among frameworks when allocating scarce
How to improve resource utilization when cluster is partitioned by scarce

Go back to the use case of GPU, I agree on the short term solutions

a) Exclude GPU from dominated resource
b) Allow framework to specify scarce resource for example GPU as capability
when register, and then Mesos only send offer with scarce resource to those
framework. The capability should not be hard coded to only handle GPU.

For long term, since scarce resource and even host attribute already
partition cluster into resource pools, we should have solution to resolve
the new introduced problems mentioned above.

1) Mesos will need enforce fairness during allocating scarce resource (or
host with special attribute) as well
2) Mesos should allow frameworks who do not ask for scarce resource (or
host with special attribute) to use host with scarce resource (or special
attribute), if there is no framework ask for scarce resource (or host with
special attribute).
3) Mesos should allow frameworks who ask for scarce resource preempt scarce
resource back.

YARN's feature of node label in future release of 2.8 (
try to resolve the similar use cases (GPU and so on). Even though YARN's
allocation model is request based, we still are able to gain experience
from it.



On Fri, Jun 10, 2016 at 10:50 PM, Benjamin Mahler <>

> I wanted to start a discussion about the allocation of "scarce" resources.
> "Scarce" in this context means resources that are not present on every
> machine. GPUs are the first example of a scarce resource that we support as
> a known resource type.
> Consider the behavior when there are the following agents in a cluster:
> 999 agents with (cpus:4,mem:1024,disk:1024)
> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
> Here there are 1000 machines but only 1 has GPUs. We call GPUs a "scarce"
> resource here because they are only present on a small percentage of the
> machines.
> We end up with some problematic behavior here with our current allocation
> model:
>     (1) If a role wishes to use both GPU and non-GPU resources for tasks,
> consuming 1 GPU will lead DRF to consider the role to have a 100% share of
> the cluster, since it consumes 100% of the GPUs in the cluster. This
> framework will then not receive any other offers.
>     (2) Because we do not have revocation yet, if a framework decides to
> consume the non-GPU resources on a GPU machine, it will prevent the GPU
> workloads from running!
> --------
> I filed an epic [1] to track this. The plan for the short-term is to
> introduce two mechanisms to mitigate these issues:
>     -Introduce a resource fairness exclusion list. This allows the shares
> of resources like "gpus" to be excluded from the dominant share.
>     -Introduce a GPU_AWARE framework capability. This indicates that the
> scheduler is aware of GPUs and will schedule tasks accordingly. Old
> schedulers will not have the capability and will not receive any offers for
> GPU machines. If a scheduler has the capability, we'll advise that they
> avoid placing their additional non-GPU workloads on the GPU machines.
> --------
> Longer term, we'll want a more robust way to manage scarce resources. The
> first thought we had was to have sub-pools of resources based on machine
> profile and perform fair sharing / quota within each pool. This addresses
> (1) cleanly, and for (2) the operator needs to explicitly disallow non-GPU
> frameworks from participating in the GPU pool.
> Unfortunately, by excluding non-GPU frameworks from the GPU pool we may
> have a lower level of utilization. In the even longer term, as we add
> revocation it will be possible to allow a scheduler desiring GPUs to revoke
> the resources allocated to the non-GPU workloads running on the GPU
> machines. There are a number of things we need to put in place to support
> revocation ([2], [3], [4], etc), so I'm glossing over the details here.
> If anyone has any thoughts or insight in this area, please share!
> Ben
> [1]
> [2]
> [3]
> [4]

Reply via email to