rawlinp commented on PR #7007:
URL: https://github.com/apache/trafficcontrol/pull/7007#issuecomment-1206860282

   > t3c runs should already be dispersed. The ATC cache-config itself doesn't 
do that, users need to make their script do that, but I think the Ansible in 
the repo has an example, and it's not difficult.
   
   I know that the runs themselves are already dispersed, but I'm saying that 
in the situation where retries are necessary (e.g. TODB is overloaded), the 
existing dispersion is not necessarily going to be enough. By adding random 
jitter, it's like expanding the dispersion window dynamically, as opposed to 
having to edit the cron job frequency in the middle of an outage scenario.
   
   For example, let's say the dispersion window is big enough to where only 100 
out of 3000 clients total are making requests to TO at any given point in time, 
but they all encounter an error and need to retry. Since they all started at 
the same time, they're all going to retry at the same time (even though it is 
only 100 out of 3000), which is not ideal. If TODB couldn't handle the load of 
100 concurrent requests for some reason, then it still wouldn't be able to 
handle these 100 concurrent retries.
   
   However, by giving the clients random jitter within that sleep interval, as 
long as the maximum sleep interval is longer than the normal cron frequency, 
we'd actually be making the requests _more_ dispersed in an outage situation, 
without having to change cron frequencies, which is what we want. That's why I 
think making the "default" retries include up to a final 4-8 minute sleep, 
along with random jitter, can really help us out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to