Re: [WIP], MEDIUM: backend: Option to redispatch on every retry

Joseph Lynch Wed, 01 Apr 2015 19:14:22 -0700

Hello,

For some background we've been running a fork with this patch in production
and it seems to be pretty stable (no issues so far). I've been testing it
in our dev/staging environments by using iptables to "kill" machines and
this change does appear to bring down the latency hit for the requests that
get sent to these dead backends. For example I configured a backend service
to have 5 retries and 100ms connection timeouts and executed the following
test:


Test command:
ab -c 100 -n 100000 localhost:<frontend_port>

Command to simulate failure, which I would run about half way through the
ab:
sudo iptables -A OUTPUT -p tcp --dport <backend_port> -j DROP

Control (no failure):
  Percentage of the requests served within a certain time (ms)
  50%     10
  66%     10
  75%     11
  80%     11
  90%     12
  95%     14
  98%     15
  99%     17
 100%     55 (longest request)

Typical retry/redispatch setup + backend failure(5 retries):
  Percentage of the requests served within a certain time (ms)
  50%      8
  66%      8
  75%      8
  80%      9
  90%     11
  95%     16
  98%     20
  99%     29
 100%    550 (longest request)

allredisp setup + backend failure(5 retries):
  Percentage of the requests served within a certain time (ms)
  50%      9
  66%     10
  75%     11
  80%     11
  90%     13
  95%     15
  98%     24
  99%    107
 100%    151 (longest request)

These results lined up with what we have witnessed when backends have
failed in production, namely that the services that had the option enabled
saw much better worst case latencies. The 100ms creeping into the 99% most
likely has to do with when I "failed" the machines and when HAProxy downed
the backend due to failed healthchecks, but I can certainly try to refine
my experimental setup. Note that the above benchmarks are runs that didn't
result in ab exiting due to failed requests against the backend that dies,
but we can't prevent that with connection retries.

Thoughts?
-Joey

On Thu, Mar 26, 2015 at 2:58 PM, Joseph Lynch <[email protected]> wrote:

> I have been experimenting with increasing resilience to backend server
> failure in a low latency environment and one thing that keeps coming up is
> that connection retries are made against the same server. Typically in our
> datacenters when we have connection issues it is because a machine or rack
> has gone dark (crashed, partitioned, etc ...). As far as I can understand,
> retries against these dead backends are just wasted time until the final
> redispatch.
>
> Our current workaround for this is to do a single retry and enable option
> redispatch (so that the single retry is a redispatch), but since network
> failures tend to be correlated we have seen issues when multiple machines
> get partitioned and then both connection attempts fail. Having redispatches
> on every retry might solve this to a large extent.
>
> It turns out that the patch to do so is pretty simple, so I tried it out
> and in my preliminary tests it seems promising. As such, I wanted to ask
> this group if it seems like a good idea? If so, I was curious what you
> think of my solution? I understand that there may be very good reasons to
> retry against a backend instance instead of redispatching, which is why I
> implemented it as an option.
>
> Thank you,
> -Joey Lynch
>

Re: [WIP], MEDIUM: backend: Option to redispatch on every retry

Reply via email to