Hi Anindya,

thanks for that nice systematic write up. It makes it pretty clear that there 
are some inconsistency how back-off is handled, and how a more systematic 
approach could help.

I’d like to make a small remark here where I can use some more space than in 
the doc.

>> On Feb 12, 2017, at 9:03 PM, Anindya Sinha <anindya_si...@apple.com> wrote:
>> Reference: https://issues.apache.org/jira/browse/MESOS-7087 
>> <https://issues.apache.org/jira/browse/MESOS-7087>
>> Currently, we have at least 3 types of backoff such as:
>> 1) Exponential backoff with randomness, as in framework/agent registration.
>> 2) Exponential backoff with no randomness, as in status updates.
>> 3) Linear backoff with randomness, as in executor registration.

We had a small water cooler discussion about this, and were wondering if it 
would be worthwhile to also take the possibility of globally rate-limiting 
certain request kinds into account, e.g., of framework/agent registration 
requests regardless of the source. This might lead to improvements for any kind 
of activity caused by state changes affecting a large number of agents or 
frameworks. I give a more technical example below.

Also, I believe when evaluating improvements to back-off, it would a good idea 
to examine the expected time difference between arrivals of messages from 
different actors as a function of the back-off procedure as a benchmark (either 
by checking the theoretical literature or by performing small Monte Carlo 



* * * 

# Technical example related to (1) above

Let’s say the following happens:

- A master failover occurs.
- All agents realize this pretty much simultaneously.
- All agents pretty much simultaneously start a registration procedure with the 
new master.

Now if there were no extra randomness introduced into the back-off (but there 
is) the master would see registration attempts from all agents pretty much at 
the same time. In large clusters this could flood the master beyond its 
abilities to handle these requests timely. That we deterministically space out 
registration attempts by larger and larger times wouldn’t help the master much 
when he’d have to deal with massive simultaneous registration load. 
Effectively, the agents inadvertently might still be performing something like 
a coordinated DDOS attack on the master by all retrying after the same time. 
Technically, the underlying issues is that the expected time difference between 
arrival times of registration attempts from different agents at the master 
would still be a Dirac delta function (think: pulse function with zero width 
sitting at zero).

Currently, the only tool protecting the master from having to handle a large 
number of registration attempts is the extra randomness we insert at the sender 
site. We pull this randomness from a uniform distribution. A uniform 
distribution is a great choice here since for a uniform distribution the tails 
of the distribution are as fat as they can get. Fat tails lead to a wider 
arrival time difference distribution at the master (it is a symmetric 
triangular distribution now instead of a delta function, still centered around 
zero though). A wider arrival time distribution means that the the probability 
of registration attempts from different agents arriving close in time is 
lowered; this is great as it potentially gives the master more time to handle 
all the requests.

The remaining issue is that even though we have spaced out requests in time by 
introducing randomness at the source, the most likely time difference between 
arrivals of two messages would still be zero (that’s just a consequence of 
statistics, the distribution for the difference of two independent random 
numbers from the same distribution is symmetric and centered around zero). We 
just have shifted some probability from smaller to larger time differences, but 
for sufficiently large clusters a master might still need to handle many more 
messages than it realistically can. Note that we use randomness at the source 
to space out requests from each other (independent random numbers), and that 
there might be no entity which could coordinate agents to collaboratively space 
out their requests more favorably for the master, e.g., in master failover 
there would be no master to coordinate the agents’ behavior.

I believe one possible solution for this would be to back pressure by the 
master rate limiting messages *before it becomes overloaded* (e.g., decided by 
examining something like the process’ message queue size or the average time a 
message stays in the queue, and dropping requests before performing any real 
work on them). This would force clients into another backoff iteration which 
would additionally space out requests.

Reply via email to