+1, seems reasonable.

On Wed, Oct 24, 2018 at 15:03 Rawlin Peters <[email protected]> wrote:

> Hey Traffic Controllers,
>
> As a large distributed system, Traffic Control has a lot of components
> that make requests to each other over the network, and these requests
> are known to fail sometimes and require retrying periodically. Some of
> these requests can also be very expensive to make and should require a
> more careful retry strategy.
>
> For that reason, I'd like to propose that we make it easier for our Go
> components to do exponential backoff retrying with randomness by
> introducing a simple Go library that would require minimal changes to
> existing retry logic and would be simple to implement in new code that
> requires retrying.
>
> I've noticed a handful of places in different Traffic Control
> components where there is retrying of operations that may potentially
> fail (e.g. making requests to the TO API). Most of these areas simply
> retry the failed operation after sleeping for X seconds, where X is
> hard-coded or configurable with a sane default. In cases where an
> operation fails due to being overloaded (or failing to come back up
> due to requests piling up), I think we could make TC more robust by
> introducing exponential backoff algorithms to our retry handlers.
>
> In cases where a service is overloaded and taking a long time to
> respond to requests because of it, simply retrying the same request
> every X seconds is not ideal and could prevent the overloaded service
> from becoming healthy again. It is generally a best practice to
> exponentially back off on retrying requests, and it is even better
> with added randomness that prevents multiple retries occurring at the
> same time from different clients.
>
> For example, this is what a lot of retries look like today:
> attempt #1 failed
> * sleep 1s *
> attempt #2 failed
> * sleep 1s *
> attempt #3 failed
> * sleep 1s *
> ... and so on
>
> This is what it would look like with exponential (factor of 2) backoff
> with some randomness/jitter:
> attempt #1 failed
> * sleep 0.8s *
> attempt #2 failed
> * sleep 2.1s *
> attempt #3 failed
> * sleep 4.2s *
> ... and so on until some max interval like 60s ...
> attempt #9 failed
> * sleep 61.3s *
> attempt #10 failed
> * sleep 53.1s *
> ... and so on
>
> I think this would greatly improve our project's story around
> retrying, and I've found a couple Go libraries that seem pretty good.
> Both are MIT-licensed:
>
> https://github.com/cenkalti/backoff
> - provides the backoff algorithm but also a handful of other useful
> features like Retry wrappers which handle looping for you
> https://github.com/jpillora/backoff
> - a bit simpler than the other and provides just the basic backoff
> calculator/counter which provides a Duration that you'd use in
> `time.Sleep()`, leaving you responsible for actually looping over the
> retriable operation
>
> I think I'd prefer jpillora/backoff for its simplicity and
> conciseness. You basically just build a struct with your backoff
> options (min, max, factor, jitter) then it spits out Durations for you
> to sleep for before retrying. It doesn't expect much else from you.
>
> Please drop a +1/-1 if you're for or against adding exponential
> backoff retrying and vendoring a project like one of the above in
> Traffic Control.
>
> Thanks,
> Rawlin
>

Reply via email to