thanks for that nice systematic write up. It makes it pretty clear that there
are some inconsistency how back-off is handled, and how a more systematic
approach could help.
I’d like to make a small remark here where I can use some more space than in
>> On Feb 12, 2017, at 9:03 PM, Anindya Sinha <anindya_si...@apple.com> wrote:
>> Reference: https://issues.apache.org/jira/browse/MESOS-7087
>> Currently, we have at least 3 types of backoff such as:
>> 1) Exponential backoff with randomness, as in framework/agent registration.
>> 2) Exponential backoff with no randomness, as in status updates.
>> 3) Linear backoff with randomness, as in executor registration.
We had a small water cooler discussion about this, and were wondering if it
would be worthwhile to also take the possibility of globally rate-limiting
certain request kinds into account, e.g., of framework/agent registration
requests regardless of the source. This might lead to improvements for any kind
of activity caused by state changes affecting a large number of agents or
frameworks. I give a more technical example below.
Also, I believe when evaluating improvements to back-off, it would a good idea
to examine the expected time difference between arrivals of messages from
different actors as a function of the back-off procedure as a benchmark (either
by checking the theoretical literature or by performing small Monte Carlo
* * *
# Technical example related to (1) above
Let’s say the following happens:
- A master failover occurs.
- All agents realize this pretty much simultaneously.
- All agents pretty much simultaneously start a registration procedure with the
Now if there were no extra randomness introduced into the back-off (but there
is) the master would see registration attempts from all agents pretty much at
the same time. In large clusters this could flood the master beyond its
abilities to handle these requests timely. That we deterministically space out
registration attempts by larger and larger times wouldn’t help the master much
when he’d have to deal with massive simultaneous registration load.
Effectively, the agents inadvertently might still be performing something like
a coordinated DDOS attack on the master by all retrying after the same time.
Technically, the underlying issues is that the expected time difference between
arrival times of registration attempts from different agents at the master
would still be a Dirac delta function (think: pulse function with zero width
sitting at zero).
Currently, the only tool protecting the master from having to handle a large
number of registration attempts is the extra randomness we insert at the sender
site. We pull this randomness from a uniform distribution. A uniform
distribution is a great choice here since for a uniform distribution the tails
of the distribution are as fat as they can get. Fat tails lead to a wider
arrival time difference distribution at the master (it is a symmetric
triangular distribution now instead of a delta function, still centered around
zero though). A wider arrival time distribution means that the the probability
of registration attempts from different agents arriving close in time is
lowered; this is great as it potentially gives the master more time to handle
all the requests.
The remaining issue is that even though we have spaced out requests in time by
introducing randomness at the source, the most likely time difference between
arrivals of two messages would still be zero (that’s just a consequence of
statistics, the distribution for the difference of two independent random
numbers from the same distribution is symmetric and centered around zero). We
just have shifted some probability from smaller to larger time differences, but
for sufficiently large clusters a master might still need to handle many more
messages than it realistically can. Note that we use randomness at the source
to space out requests from each other (independent random numbers), and that
there might be no entity which could coordinate agents to collaboratively space
out their requests more favorably for the master, e.g., in master failover
there would be no master to coordinate the agents’ behavior.
I believe one possible solution for this would be to back pressure by the
master rate limiting messages *before it becomes overloaded* (e.g., decided by
examining something like the process’ message queue size or the average time a
message stays in the queue, and dropping requests before performing any real
work on them). This would force clients into another backoff iteration which
would additionally space out requests.