I don't mean to bump unnecessarily, but is there any feedback on this
approach? I can re-submit as a patch with documentation and diffed
against the latest dev branch if the feedback is "let's pursue this".

-Joey

On Wed, Apr 1, 2015 at 7:13 PM, Joseph Lynch <[email protected]> wrote:
> Hello,
>
> For some background we've been running a fork with this patch in production
> and it seems to be pretty stable (no issues so far). I've been testing it in
> our dev/staging environments by using iptables to "kill" machines and this
> change does appear to bring down the latency hit for the requests that get
> sent to these dead backends. For example I configured a backend service to
> have 5 retries and 100ms connection timeouts and executed the following
> test:
>
> Test command:
> ab -c 100 -n 100000 localhost:<frontend_port>
>
> Command to simulate failure, which I would run about half way through the
> ab:
> sudo iptables -A OUTPUT -p tcp --dport <backend_port> -j DROP
>
> Control (no failure):
>   Percentage of the requests served within a certain time (ms)
>   50%     10
>   66%     10
>   75%     11
>   80%     11
>   90%     12
>   95%     14
>   98%     15
>   99%     17
>  100%     55 (longest request)
>
> Typical retry/redispatch setup + backend failure(5 retries):
>   Percentage of the requests served within a certain time (ms)
>   50%      8
>   66%      8
>   75%      8
>   80%      9
>   90%     11
>   95%     16
>   98%     20
>   99%     29
>  100%    550 (longest request)
>
> allredisp setup + backend failure(5 retries):
>   Percentage of the requests served within a certain time (ms)
>   50%      9
>   66%     10
>   75%     11
>   80%     11
>   90%     13
>   95%     15
>   98%     24
>   99%    107
>  100%    151 (longest request)
>
> These results lined up with what we have witnessed when backends have failed
> in production, namely that the services that had the option enabled saw much
> better worst case latencies. The 100ms creeping into the 99% most likely has
> to do with when I "failed" the machines and when HAProxy downed the backend
> due to failed healthchecks, but I can certainly try to refine my
> experimental setup. Note that the above benchmarks are runs that didn't
> result in ab exiting due to failed requests against the backend that dies,
> but we can't prevent that with connection retries.
>
> Thoughts?
> -Joey
>
> On Thu, Mar 26, 2015 at 2:58 PM, Joseph Lynch <[email protected]> wrote:
>>
>> I have been experimenting with increasing resilience to backend server
>> failure in a low latency environment and one thing that keeps coming up is
>> that connection retries are made against the same server. Typically in our
>> datacenters when we have connection issues it is because a machine or rack
>> has gone dark (crashed, partitioned, etc ...). As far as I can understand,
>> retries against these dead backends are just wasted time until the final
>> redispatch.
>>
>> Our current workaround for this is to do a single retry and enable option
>> redispatch (so that the single retry is a redispatch), but since network
>> failures tend to be correlated we have seen issues when multiple machines
>> get partitioned and then both connection attempts fail. Having redispatches
>> on every retry might solve this to a large extent.
>>
>> It turns out that the patch to do so is pretty simple, so I tried it out
>> and in my preliminary tests it seems promising. As such, I wanted to ask
>> this group if it seems like a good idea? If so, I was curious what you think
>> of my solution? I understand that there may be very good reasons to retry
>> against a backend instance instead of redispatching, which is why I
>> implemented it as an option.
>>
>> Thank you,
>> -Joey Lynch
>
>

Reply via email to