Meng Zhu created MESOS-9324:
-------------------------------
Summary: Resource fragmentation: frameworks may be starved of port
resources in the presence of large number frameworks with quota.
Key: MESOS-9324
URL: https://issues.apache.org/jira/browse/MESOS-9324
Project: Mesos
Issue Type: Bug
Components: allocation
Reporter: Meng Zhu
In our environment where there are 1.5k frameworks and quota is heavily
utilized, we would experience a severe resource fragmentation issue.
Specifically, we observed a large number of port-less offers circulating in the
cluster. Thus frameworks that need port resources are not able to launch tasks
even if their roles have quota (because currently, we can only set quota for
scalar resources, not port range resources).
While most of the 1.5k frameworks do not suppress today and we believe the
situation will significantly improve once they do. Still, I think there are
some improvements the Mesos allocator can make to help.
*## How resource becomes fragmented*
The origin of these port-less offers stems from quota chopping. Specifically,
when chopping an agent to satisfy a role’s quota, we will also hand out
resources that this role does not have quota for (as long as it does not break
other role’s quota). These “extra resources” certainly includes ALL the
remaining port resources on the agent. After this offer, the agent will be left
with no port resources even though it still has CPUs and etc. Later, these
resources may be offered to other frameworks but they are useless due to no
ports. Now we have some “bad offers” in the cluster.
*## How resource fragmentation prolonged*
A resource offer, once it is declined (e.g. due to no ports), is recovered by
the allocator and offered to other frameworks again. Before this happens, it is
possible that this offer might be able to merge with either the remaining
resources or other declined resources on the same agent. However, it is
conceivable that not uncommonly, the declined offer will be hand out again
*as-is*. This is especially probable if the allocator makes offers faster than
the framework offer response time. As a result, we will observe the circulation
of bad offers across different frameworks. These bad offers will exist for a
long time before being consolidated again. For how long? *The longevity of the
bad offer will be roughly proportional to the number of active frameworks*. In
the worse case, once all the active frameworks have (hopefully long) declined
the bad offer, the bad offer will have nowhere to go and finally start to merge
with other resources on that agent.
Note, since the allocator performance has greatly improved in the past several
months. The scenario described here could be increasingly common. Also, as we
introduce quota limits and hierarchical quota, there will be much more agent
chopping, making resource fragmentation even worse.
*## Near-term Mitigations*
As mentioned above, the longevity of a bad offer is proportional to the active
frameworks. Thus framework suppression will certainly help. In addition, from
the Mesos side, a couple of mitigation measures are worth considering (other
than the long term optimistic allocation strategy):
1. Adding a defragment interval once in a while in the allocator. For example,
each minute or a dozen allocation cycles or so, we will pause the allocation,
rescind all the offers and start allocating again. This essentially eliminates
all the circulating bad offers by giving them a chance to be consolidated.
Think of this as a periodic “reboot” of the allocator.
2. Consider chopping non-quota resources as well. Right now, for resources such
as ports (or any other resources that the role does not have quota for), all
are allocated in a single offer. We could choose to chop these non-quota
resources as well. For example, port resources can be distributed
proportionally to allocated CPU resources.
3. Provide support for specifying port quantities. With this, we can utilize
the existing quota or `min_allocatable_resources` APIs to guarantee a certain
number of port resources.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)