I wanted to start a discussion about the allocation of "scarce" resources.
"Scarce" in this context means resources that are not present on every
machine. GPUs are the first example of a scarce resource that we support as
a known resource type.
Consider the behavior when there are the following agents in a cluster:
999 agents with (cpus:4,mem:1024,disk:1024)
1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
Here there are 1000 machines but only 1 has GPUs. We call GPUs a "scarce"
resource here because they are only present on a small percentage of the
We end up with some problematic behavior here with our current allocation
(1) If a role wishes to use both GPU and non-GPU resources for tasks,
consuming 1 GPU will lead DRF to consider the role to have a 100% share of
the cluster, since it consumes 100% of the GPUs in the cluster. This
framework will then not receive any other offers.
(2) Because we do not have revocation yet, if a framework decides to
consume the non-GPU resources on a GPU machine, it will prevent the GPU
workloads from running!
I filed an epic  to track this. The plan for the short-term is to
introduce two mechanisms to mitigate these issues:
-Introduce a resource fairness exclusion list. This allows the shares
of resources like "gpus" to be excluded from the dominant share.
-Introduce a GPU_AWARE framework capability. This indicates that the
scheduler is aware of GPUs and will schedule tasks accordingly. Old
schedulers will not have the capability and will not receive any offers for
GPU machines. If a scheduler has the capability, we'll advise that they
avoid placing their additional non-GPU workloads on the GPU machines.
Longer term, we'll want a more robust way to manage scarce resources. The
first thought we had was to have sub-pools of resources based on machine
profile and perform fair sharing / quota within each pool. This addresses
(1) cleanly, and for (2) the operator needs to explicitly disallow non-GPU
frameworks from participating in the GPU pool.
Unfortunately, by excluding non-GPU frameworks from the GPU pool we may
have a lower level of utilization. In the even longer term, as we add
revocation it will be possible to allow a scheduler desiring GPUs to revoke
the resources allocated to the non-GPU workloads running on the GPU
machines. There are a number of things we need to put in place to support
revocation (, , , etc), so I'm glossing over the details here.
If anyone has any thoughts or insight in this area, please share!