------------------------------------------------------------------------
*From: *Sok Ann Yap <[email protected]>
*Sent: * 2014-02-21 05:11:48 E
*To: *[email protected]
*Subject: *Re: Just a simple thought on health checks after a soft
reload of HAProxy....
> Patrick Hemmer <haproxy@...> writes:
>
>> From: Willy Tarreau <w <at> 1wt.eu>
>>
>> Sent: 2014-01-25 05:45:11 E
>>
>> Till now that's exactly what's currently done. The servers are marked
>> "almost dead", so the first check gives the verdict. Initially we had
>> all checks started immediately. But it caused a lot of issues at several
>> places where there were a high number of backends or servers mapped to
>> the same hardware, because the rush of connection really caused the
>> servers to be flagged as down. So we started to spread the checks over
>> the longest check period in a farm.
>>
>> Is there a way to enable this behavior? In my
>> environment/configuration, it causes absolutely no issue that all
>> the checks be fired off at the same time.
>> As it is right now, when haproxy starts up, it takes it quite a
>> while to discover which servers are down.
>> -Patrick
>>
> I faced the same problem in http://thread.gmane.org/
> gmane.comp.web.haproxy/14644
>
> After much contemplation, I decided to just patch away the initial spread
> check behavior: https://github.com/sayap/sayap-overlay/blob/master/net-
> proxy/haproxy/files/haproxy-immediate-first-check.diff
>
>
I definitely think there should be an option to disable the behavior. We
have an automated system which adds and removes servers from the config,
and then bounces haproxy. Every time haproxy is bounced, we have a
period where it can send traffic to a dead server.
There's also a related bug on this.
The bug is that when I have a config with "inter 30s fastinter 1s" and
no httpchk enabled, when haproxy first starts up, it spreads the checks
over the period defined as fastinter, but the stats output says "UP 1/3"
for the full 30 seconds. It also says "L4OK in 30001ms", when I know it
doesn't take the server 30 seconds to simply accept a connection.
Yet you get different behavior when using httpchk. When I add "option
httpchk", it still spreads the checks over the 1s fastinter value, but
the stats output goes full "UP" immediately after the check occurs, not
"UP 1/3". It also says "L7OK/200 in 0ms", which is what I expect to see.
-Patrick