I have been experimenting with increasing resilience to backend server failure in a low latency environment and one thing that keeps coming up is that connection retries are made against the same server. Typically in our datacenters when we have connection issues it is because a machine or rack has gone dark (crashed, partitioned, etc ...). As far as I can understand, retries against these dead backends are just wasted time until the final redispatch.
Our current workaround for this is to do a single retry and enable option redispatch (so that the single retry is a redispatch), but since network failures tend to be correlated we have seen issues when multiple machines get partitioned and then both connection attempts fail. Having redispatches on every retry might solve this to a large extent. It turns out that the patch to do so is pretty simple, so I tried it out and in my preliminary tests it seems promising. As such, I wanted to ask this group if it seems like a good idea? If so, I was curious what you think of my solution? I understand that there may be very good reasons to retry against a backend instance instead of redispatching, which is why I implemented it as an option. Thank you, -Joey Lynch
0001-Allow-redispatch-on-every-request.patch
Description: Binary data

