Hi Anindya, thanks for that nice systematic write up. It makes it pretty clear that there are some inconsistency how back-off is handled, and how a more systematic approach could help.
I’d like to make a small remark here where I can use some more space than in the doc. >> On Feb 12, 2017, at 9:03 PM, Anindya Sinha <anindya_si...@apple.com> wrote: >> >> Reference: https://issues.apache.org/jira/browse/MESOS-7087 >> <https://issues.apache.org/jira/browse/MESOS-7087> >> >> Currently, we have at least 3 types of backoff such as: >> 1) Exponential backoff with randomness, as in framework/agent registration. >> 2) Exponential backoff with no randomness, as in status updates. >> 3) Linear backoff with randomness, as in executor registration. We had a small water cooler discussion about this, and were wondering if it would be worthwhile to also take the possibility of globally rate-limiting certain request kinds into account, e.g., of framework/agent registration requests regardless of the source. This might lead to improvements for any kind of activity caused by state changes affecting a large number of agents or frameworks. I give a more technical example below. Also, I believe when evaluating improvements to back-off, it would a good idea to examine the expected time difference between arrivals of messages from different actors as a function of the back-off procedure as a benchmark (either by checking the theoretical literature or by performing small Monte Carlo simulations). Cheers, Benjamin * * * # Technical example related to (1) above Let’s say the following happens: - A master failover occurs. - All agents realize this pretty much simultaneously. - All agents pretty much simultaneously start a registration procedure with the new master. Now if there were no extra randomness introduced into the back-off (but there is) the master would see registration attempts from all agents pretty much at the same time. In large clusters this could flood the master beyond its abilities to handle these requests timely. That we deterministically space out registration attempts by larger and larger times wouldn’t help the master much when he’d have to deal with massive simultaneous registration load. Effectively, the agents inadvertently might still be performing something like a coordinated DDOS attack on the master by all retrying after the same time. Technically, the underlying issues is that the expected time difference between arrival times of registration attempts from different agents at the master would still be a Dirac delta function (think: pulse function with zero width sitting at zero). Currently, the only tool protecting the master from having to handle a large number of registration attempts is the extra randomness we insert at the sender site. We pull this randomness from a uniform distribution. A uniform distribution is a great choice here since for a uniform distribution the tails of the distribution are as fat as they can get. Fat tails lead to a wider arrival time difference distribution at the master (it is a symmetric triangular distribution now instead of a delta function, still centered around zero though). A wider arrival time distribution means that the the probability of registration attempts from different agents arriving close in time is lowered; this is great as it potentially gives the master more time to handle all the requests. The remaining issue is that even though we have spaced out requests in time by introducing randomness at the source, the most likely time difference between arrivals of two messages would still be zero (that’s just a consequence of statistics, the distribution for the difference of two independent random numbers from the same distribution is symmetric and centered around zero). We just have shifted some probability from smaller to larger time differences, but for sufficiently large clusters a master might still need to handle many more messages than it realistically can. Note that we use randomness at the source to space out requests from each other (independent random numbers), and that there might be no entity which could coordinate agents to collaboratively space out their requests more favorably for the master, e.g., in master failover there would be no master to coordinate the agents’ behavior. I believe one possible solution for this would be to back pressure by the master rate limiting messages *before it becomes overloaded* (e.g., decided by examining something like the process’ message queue size or the average time a message stays in the queue, and dropping requests before performing any real work on them). This would force clients into another backoff iteration which would additionally space out requests.