[
https://issues.apache.org/jira/browse/MESOS-3165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Rukletsov updated MESOS-3165:
---------------------------------------
Description:
To persist quotas across failovers, the Master should save them in the
registry. To support this, we shall:
* Introduce a Quota state variable in registry.proto;
* Extend the Operation interface so that it supports a ‘Quota’ accumulator (see
src/master/registrar.hpp);
* Introduce AddQuota / RemoveQuota operations;
* Recover quotas from the registry on failover to the Master’s
internal::master::Role struct;
* Extend RegistrarTest with quota-specific tests.
NOTE: Registry variable can be rather big for production clusters (see
MESOS-2075). While it should be fine for MVP to add quota information to
registry, we should consider storing Quota separately, as this does not need to
be in sync with slaves update. However, currently adding more variable is not
supported by the registrar.
While the Agents are reregistering (note they may fail to do so), the
information about what part of the quota is allocated is only partially
available to the Master. In other words, the state of the quota allocation is
reconstructed as Agents reregister. During this period, some roles may be under
quota from the perspective of the newly elected Master.
The same problem exists on the allocator side: it may think the cluster is
under quota and may eagerly try to satisfy quotas before enough Agents
reregister, which may result in resources being allocated to frameworks beyond
their quota. To address this issue and also to avoid panicking and generating
under quota alerts, the Master should give a certain amount of time for the
majority (e.g. 80%) of the Agents to reregister before reporting any quota
status and notifying the allocator about granted quotas.
was:
To persist quotas across failovers, the Master should save them in the
registry. To support this, we shall:
* Introduce a Quota state variable in registry.proto;
* Extend the Operation interface so that it supports a ‘Quota’ accumulator (see
src/master/registrar.hpp);
* Introduce AddQuota / RemoveQuota operations;
* Recover quotas from the registry on failover to the Master’s
internal::master::Role struct;
* Extend RegistrarTest with quota-specific tests.
NOTE: Registry variable can be rather big for production clusters (see
MESOS-2075). While it should be fine for MVP to add quota information to
registry, we should consider storing Quota separately, as this does not need to
be in sync with slaves update. However, currently adding more variable is not
supported by the registrar.
> Persist and recover quota to/from Registry
> ------------------------------------------
>
> Key: MESOS-3165
> URL: https://issues.apache.org/jira/browse/MESOS-3165
> Project: Mesos
> Issue Type: Task
> Components: master, replicated log
> Reporter: Alexander Rukletsov
> Assignee: Alexander Rukletsov
> Labels: mesosphere
>
> To persist quotas across failovers, the Master should save them in the
> registry. To support this, we shall:
> * Introduce a Quota state variable in registry.proto;
> * Extend the Operation interface so that it supports a ‘Quota’ accumulator
> (see src/master/registrar.hpp);
> * Introduce AddQuota / RemoveQuota operations;
> * Recover quotas from the registry on failover to the Master’s
> internal::master::Role struct;
> * Extend RegistrarTest with quota-specific tests.
> NOTE: Registry variable can be rather big for production clusters (see
> MESOS-2075). While it should be fine for MVP to add quota information to
> registry, we should consider storing Quota separately, as this does not need
> to be in sync with slaves update. However, currently adding more variable is
> not supported by the registrar.
> While the Agents are reregistering (note they may fail to do so), the
> information about what part of the quota is allocated is only partially
> available to the Master. In other words, the state of the quota allocation is
> reconstructed as Agents reregister. During this period, some roles may be
> under quota from the perspective of the newly elected Master.
> The same problem exists on the allocator side: it may think the cluster is
> under quota and may eagerly try to satisfy quotas before enough Agents
> reregister, which may result in resources being allocated to frameworks
> beyond their quota. To address this issue and also to avoid panicking and
> generating under quota alerts, the Master should give a certain amount of
> time for the majority (e.g. 80%) of the Agents to reregister before reporting
> any quota status and notifying the allocator about granted quotas.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)