+1 I support the goal of making Traffic Control more reliable and think this is a step in the right direction.
In some of our other products, we’ve had good luck using a circuit breaker as part of the retry process (https://pypi.org/project/circuitbreaker/, linked just for algorithm but I’m not suggesting we use it.) Exponential retries are useful, but they must be capped at a certain level or number of retries to provide some guarantee about response time to the user. The state of the retries must also be shared across all callers of the API within a certain execution context. If we have a program with many threads that all use the TO API, the failure of API calls in one thread should result in a back-off or blockage of calls from other threads. Otherwise, we may continue overloading the API with requests. Pulling in a library like this is the first step, employing it properly is a larger challenge. —Eric On Oct 24, 2018, at 7:43 PM, Dave Neuman <[email protected]<mailto:[email protected]>> wrote: +1, seems reasonable. On Wed, Oct 24, 2018 at 15:03 Rawlin Peters <[email protected]<mailto:[email protected]>> wrote: Hey Traffic Controllers, As a large distributed system, Traffic Control has a lot of components that make requests to each other over the network, and these requests are known to fail sometimes and require retrying periodically. Some of these requests can also be very expensive to make and should require a more careful retry strategy. For that reason, I'd like to propose that we make it easier for our Go components to do exponential backoff retrying with randomness by introducing a simple Go library that would require minimal changes to existing retry logic and would be simple to implement in new code that requires retrying. I've noticed a handful of places in different Traffic Control components where there is retrying of operations that may potentially fail (e.g. making requests to the TO API). Most of these areas simply retry the failed operation after sleeping for X seconds, where X is hard-coded or configurable with a sane default. In cases where an operation fails due to being overloaded (or failing to come back up due to requests piling up), I think we could make TC more robust by introducing exponential backoff algorithms to our retry handlers. In cases where a service is overloaded and taking a long time to respond to requests because of it, simply retrying the same request every X seconds is not ideal and could prevent the overloaded service from becoming healthy again. It is generally a best practice to exponentially back off on retrying requests, and it is even better with added randomness that prevents multiple retries occurring at the same time from different clients. For example, this is what a lot of retries look like today: attempt #1 failed * sleep 1s * attempt #2 failed * sleep 1s * attempt #3 failed * sleep 1s * ... and so on This is what it would look like with exponential (factor of 2) backoff with some randomness/jitter: attempt #1 failed * sleep 0.8s * attempt #2 failed * sleep 2.1s * attempt #3 failed * sleep 4.2s * ... and so on until some max interval like 60s ... attempt #9 failed * sleep 61.3s * attempt #10 failed * sleep 53.1s * ... and so on I think this would greatly improve our project's story around retrying, and I've found a couple Go libraries that seem pretty good. Both are MIT-licensed: https://github.com/cenkalti/backoff - provides the backoff algorithm but also a handful of other useful features like Retry wrappers which handle looping for you https://github.com/jpillora/backoff - a bit simpler than the other and provides just the basic backoff calculator/counter which provides a Duration that you'd use in `time.Sleep()`, leaving you responsible for actually looping over the retriable operation I think I'd prefer jpillora/backoff for its simplicity and conciseness. You basically just build a struct with your backoff options (min, max, factor, jitter) then it spits out Durations for you to sleep for before retrying. It doesn't expect much else from you. Please drop a +1/-1 if you're for or against adding exponential backoff retrying and vendoring a project like one of the above in Traffic Control. Thanks, Rawlin
