+1, seems reasonable. On Wed, Oct 24, 2018 at 15:03 Rawlin Peters <[email protected]> wrote:
> Hey Traffic Controllers, > > As a large distributed system, Traffic Control has a lot of components > that make requests to each other over the network, and these requests > are known to fail sometimes and require retrying periodically. Some of > these requests can also be very expensive to make and should require a > more careful retry strategy. > > For that reason, I'd like to propose that we make it easier for our Go > components to do exponential backoff retrying with randomness by > introducing a simple Go library that would require minimal changes to > existing retry logic and would be simple to implement in new code that > requires retrying. > > I've noticed a handful of places in different Traffic Control > components where there is retrying of operations that may potentially > fail (e.g. making requests to the TO API). Most of these areas simply > retry the failed operation after sleeping for X seconds, where X is > hard-coded or configurable with a sane default. In cases where an > operation fails due to being overloaded (or failing to come back up > due to requests piling up), I think we could make TC more robust by > introducing exponential backoff algorithms to our retry handlers. > > In cases where a service is overloaded and taking a long time to > respond to requests because of it, simply retrying the same request > every X seconds is not ideal and could prevent the overloaded service > from becoming healthy again. It is generally a best practice to > exponentially back off on retrying requests, and it is even better > with added randomness that prevents multiple retries occurring at the > same time from different clients. > > For example, this is what a lot of retries look like today: > attempt #1 failed > * sleep 1s * > attempt #2 failed > * sleep 1s * > attempt #3 failed > * sleep 1s * > ... and so on > > This is what it would look like with exponential (factor of 2) backoff > with some randomness/jitter: > attempt #1 failed > * sleep 0.8s * > attempt #2 failed > * sleep 2.1s * > attempt #3 failed > * sleep 4.2s * > ... and so on until some max interval like 60s ... > attempt #9 failed > * sleep 61.3s * > attempt #10 failed > * sleep 53.1s * > ... and so on > > I think this would greatly improve our project's story around > retrying, and I've found a couple Go libraries that seem pretty good. > Both are MIT-licensed: > > https://github.com/cenkalti/backoff > - provides the backoff algorithm but also a handful of other useful > features like Retry wrappers which handle looping for you > https://github.com/jpillora/backoff > - a bit simpler than the other and provides just the basic backoff > calculator/counter which provides a Duration that you'd use in > `time.Sleep()`, leaving you responsible for actually looping over the > retriable operation > > I think I'd prefer jpillora/backoff for its simplicity and > conciseness. You basically just build a struct with your backoff > options (min, max, factor, jitter) then it spits out Durations for you > to sleep for before retrying. It doesn't expect much else from you. > > Please drop a +1/-1 if you're for or against adding exponential > backoff retrying and vendoring a project like one of the above in > Traffic Control. > > Thanks, > Rawlin >
