Reference: https://issues.apache.org/jira/browse/MESOS-7087 
<https://issues.apache.org/jira/browse/MESOS-7087>

Currently, we have at least 3 types of backoff such as:
1) Exponential backoff with randomness, as in framework/agent registration.
2) Exponential backoff with no randomness, as in status updates.
3) Linear backoff with randomness, as in executor registration.

In framework registration as an example, each retry ranges between [0 .. 
b*2^(n-1)] for nth retry attempt as long as each interval is less than 1 min.

For clusters with large number of frameworks and/or agents, the randomness may 
not be enough since the timeout can end up being very small for a substantial 
number of clients (agents and/or frameworks) due to the fact that the allowed 
range is [0 .. <n>] for all retry attempts.

The following doc looks at an enhancement to the existing proposal to ensure 
that the timeout values are not extremely small, and that every subsequent 
retry should have a timeout value atleast as much as the previous iteration.

https://docs.google.com/document/d/1nUxvh6BbB8jv5G-MvckGj9XzFYLBrUM0O5Go_Zmdftk/edit?usp=sharing
 
<https://docs.google.com/document/d/1nUxvh6BbB8jv5G-MvckGj9XzFYLBrUM0O5Go_Zmdftk/edit?usp=sharing>

Feedback welcome.

Thanks
Anindya

Reply via email to