Hello, For some background we've been running a fork with this patch in production and it seems to be pretty stable (no issues so far). I've been testing it in our dev/staging environments by using iptables to "kill" machines and this change does appear to bring down the latency hit for the requests that get sent to these dead backends. For example I configured a backend service to have 5 retries and 100ms connection timeouts and executed the following test:
Test command: ab -c 100 -n 100000 localhost:<frontend_port> Command to simulate failure, which I would run about half way through the ab: sudo iptables -A OUTPUT -p tcp --dport <backend_port> -j DROP Control (no failure): Percentage of the requests served within a certain time (ms) 50% 10 66% 10 75% 11 80% 11 90% 12 95% 14 98% 15 99% 17 100% 55 (longest request) Typical retry/redispatch setup + backend failure(5 retries): Percentage of the requests served within a certain time (ms) 50% 8 66% 8 75% 8 80% 9 90% 11 95% 16 98% 20 99% 29 100% 550 (longest request) allredisp setup + backend failure(5 retries): Percentage of the requests served within a certain time (ms) 50% 9 66% 10 75% 11 80% 11 90% 13 95% 15 98% 24 99% 107 100% 151 (longest request) These results lined up with what we have witnessed when backends have failed in production, namely that the services that had the option enabled saw much better worst case latencies. The 100ms creeping into the 99% most likely has to do with when I "failed" the machines and when HAProxy downed the backend due to failed healthchecks, but I can certainly try to refine my experimental setup. Note that the above benchmarks are runs that didn't result in ab exiting due to failed requests against the backend that dies, but we can't prevent that with connection retries. Thoughts? -Joey On Thu, Mar 26, 2015 at 2:58 PM, Joseph Lynch <[email protected]> wrote: > I have been experimenting with increasing resilience to backend server > failure in a low latency environment and one thing that keeps coming up is > that connection retries are made against the same server. Typically in our > datacenters when we have connection issues it is because a machine or rack > has gone dark (crashed, partitioned, etc ...). As far as I can understand, > retries against these dead backends are just wasted time until the final > redispatch. > > Our current workaround for this is to do a single retry and enable option > redispatch (so that the single retry is a redispatch), but since network > failures tend to be correlated we have seen issues when multiple machines > get partitioned and then both connection attempts fail. Having redispatches > on every retry might solve this to a large extent. > > It turns out that the patch to do so is pretty simple, so I tried it out > and in my preliminary tests it seems promising. As such, I wanted to ask > this group if it seems like a good idea? If so, I was curious what you > think of my solution? I understand that there may be very good reasons to > retry against a backend instance instead of redispatching, which is why I > implemented it as an option. > > Thank you, > -Joey Lynch >

