[ 
https://issues.apache.org/jira/browse/MESOS-3165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3165:
---------------------------------------
    Description: 
To persist quotas across failovers, the Master should save them in the 
registry. To support this, we shall:
* Introduce a Quota state variable in registry.proto;
* Extend the Operation interface so that it supports a ‘Quota’ accumulator (see 
src/master/registrar.hpp);
* Introduce AddQuota / RemoveQuota operations;
* Recover quotas from the registry on failover to the Master’s 
internal::master::Role struct;
* Extend RegistrarTest with quota-specific tests.

NOTE: Registry variable can be rather big for production clusters (see 
MESOS-2075). While it should be fine for MVP to add quota information to 
registry, we should consider storing Quota separately, as this does not need to 
be in sync with slaves update. However, currently adding more variable is not 
supported by the registrar.

While the Agents are reregistering (note they may fail to do so), the 
information about what part of the quota is allocated is only partially 
available to the Master. In other words, the state of the quota allocation is 
reconstructed as Agents reregister. During this period, some roles may be under 
quota from the perspective of the newly elected Master.

The same problem exists on the allocator side: it may think the cluster is 
under quota and may eagerly try to satisfy quotas before enough Agents 
reregister, which may result in resources being allocated to frameworks beyond 
their quota. To address this issue and also to avoid panicking and generating 
under quota alerts, the Master should give a certain amount of time for the 
majority (e.g. 80%) of the Agents to reregister before reporting any quota 
status and notifying the allocator about granted quotas.

  was:
To persist quotas across failovers, the Master should save them in the 
registry. To support this, we shall:
* Introduce a Quota state variable in registry.proto;
* Extend the Operation interface so that it supports a ‘Quota’ accumulator (see 
src/master/registrar.hpp);
* Introduce AddQuota / RemoveQuota operations;
* Recover quotas from the registry on failover to the Master’s 
internal::master::Role struct;
* Extend RegistrarTest with quota-specific tests.

NOTE: Registry variable can be rather big for production clusters (see 
MESOS-2075). While it should be fine for MVP to add quota information to 
registry, we should consider storing Quota separately, as this does not need to 
be in sync with slaves update. However, currently adding more variable is not 
supported by the registrar.


> Persist and recover quota to/from Registry
> ------------------------------------------
>
>                 Key: MESOS-3165
>                 URL: https://issues.apache.org/jira/browse/MESOS-3165
>             Project: Mesos
>          Issue Type: Task
>          Components: master, replicated log
>            Reporter: Alexander Rukletsov
>            Assignee: Alexander Rukletsov
>              Labels: mesosphere
>
> To persist quotas across failovers, the Master should save them in the 
> registry. To support this, we shall:
> * Introduce a Quota state variable in registry.proto;
> * Extend the Operation interface so that it supports a ‘Quota’ accumulator 
> (see src/master/registrar.hpp);
> * Introduce AddQuota / RemoveQuota operations;
> * Recover quotas from the registry on failover to the Master’s 
> internal::master::Role struct;
> * Extend RegistrarTest with quota-specific tests.
> NOTE: Registry variable can be rather big for production clusters (see 
> MESOS-2075). While it should be fine for MVP to add quota information to 
> registry, we should consider storing Quota separately, as this does not need 
> to be in sync with slaves update. However, currently adding more variable is 
> not supported by the registrar.
> While the Agents are reregistering (note they may fail to do so), the 
> information about what part of the quota is allocated is only partially 
> available to the Master. In other words, the state of the quota allocation is 
> reconstructed as Agents reregister. During this period, some roles may be 
> under quota from the perspective of the newly elected Master.
> The same problem exists on the allocator side: it may think the cluster is 
> under quota and may eagerly try to satisfy quotas before enough Agents 
> reregister, which may result in resources being allocated to frameworks 
> beyond their quota. To address this issue and also to avoid panicking and 
> generating under quota alerts, the Master should give a certain amount of 
> time for the majority (e.g. 80%) of the Agents to reregister before reporting 
> any quota status and notifying the allocator about granted quotas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to