I have been experimenting with increasing resilience to backend server
failure in a low latency environment and one thing that keeps coming up is
that connection retries are made against the same server. Typically in our
datacenters when we have connection issues it is because a machine or rack
has gone dark (crashed, partitioned, etc ...). As far as I can understand,
retries against these dead backends are just wasted time until the final
redispatch.

Our current workaround for this is to do a single retry and enable option
redispatch (so that the single retry is a redispatch), but since network
failures tend to be correlated we have seen issues when multiple machines
get partitioned and then both connection attempts fail. Having redispatches
on every retry might solve this to a large extent.

It turns out that the patch to do so is pretty simple, so I tried it out
and in my preliminary tests it seems promising. As such, I wanted to ask
this group if it seems like a good idea? If so, I was curious what you
think of my solution? I understand that there may be very good reasons to
retry against a backend instance instead of redispatching, which is why I
implemented it as an option.

Thank you,
-Joey Lynch

Attachment: 0001-Allow-redispatch-on-every-request.patch
Description: Binary data

Reply via email to