Anindya Sinha created MESOS-7087:
------------------------------------
Summary: Consider improving exponential backoff algorithm.
Key: MESOS-7087
URL: https://issues.apache.org/jira/browse/MESOS-7087
Project: Mesos
Issue Type: Improvement
Components: general
Reporter: Anindya Sinha
Assignee: Anindya Sinha
There are 3 types of backoff algorithms in use:
1) Exponential backoff with randomness, as in framework/agent registration.
2) Exponential backoff with no randomness, as in status updates.
3) Linear backoff with randomness, as in executor registration.
Consider framework registration. nth retry attempt is done after a random
interval ranging between [0 .. backoff * 2^(n-1)] as long as each interval is
less than 1 min. The default value for backoff is 2secs.
Although the current approach brings in exponential backoff with randomness, we
have observed that for clusters with thousands of agents and/or frameworks, the
actual retry interval (which is randomized) can end up being very frequent for
a substantial number of agents and/or frameworks due to the fact that the
allowed range is [0 .. <n>], which leads to bombarding the master with tons of
messages thereby overloading it.
So, the main issues seen are (esp for large number of frameworks and/or agents)
are:
1) Every subsequent retry should be spaced off by a minimum deterministic
amount from the previous attempt.
2) Every subsequent retry should be greater or equal to the previous attempt.
3) Maximum retry interval should be configurable since it can be a function of
the initial backoff factor.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)