rawlinp commented on PR #7007: URL: https://github.com/apache/trafficcontrol/pull/7007#issuecomment-1206860282
> t3c runs should already be dispersed. The ATC cache-config itself doesn't do that, users need to make their script do that, but I think the Ansible in the repo has an example, and it's not difficult. I know that the runs themselves are already dispersed, but I'm saying that in the situation where retries are necessary (e.g. TODB is overloaded), the existing dispersion is not necessarily going to be enough. By adding random jitter, it's like expanding the dispersion window dynamically, as opposed to having to edit the cron job frequency in the middle of an outage scenario. For example, let's say the dispersion window is big enough to where only 100 out of 3000 clients total are making requests to TO at any given point in time, but they all encounter an error and need to retry. Since they all started at the same time, they're all going to retry at the same time (even though it is only 100 out of 3000), which is not ideal. If TODB couldn't handle the load of 100 concurrent requests for some reason, then it still wouldn't be able to handle these 100 concurrent retries. However, by giving the clients random jitter within that sleep interval, as long as the maximum sleep interval is longer than the normal cron frequency, we'd actually be making the requests _more_ dispersed in an outage situation, without having to change cron frequencies, which is what we want. That's why I think making the "default" retries include up to a final 4-8 minute sleep, along with random jitter, can really help us out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
