This is also an issue for us (see my post from a few days ago) - on HAProxy's first start, most hosts are marked DOWN with a Layer4 timeout, even though they are fine, because there are a large number of them.
Some workaround or more forgiving initial health check would be useful here. ---- Kevin Burke | 415-723-4116 | www.twilio.com On Tue, Jan 28, 2014 at 8:13 AM, Patrick Hemmer <[email protected]>wrote: > *From: *Willy Tarreau <[email protected]> <[email protected]> > *Sent: * 2014-01-25 05:45:11 E > *To: *Patrick Hemmer <[email protected]> <[email protected]> > *CC: *Malcolm Turnbull <[email protected]><[email protected]>, > [email protected] <[email protected]> <[email protected]> > *Subject: *Re: Just a simple thought on health checks after a soft reload > of HAProxy.... > > On Tue, Jan 21, 2014 at 09:04:12PM -0500, Patrick Hemmer wrote: > > Personally I would not like that every server is considered down until > after the health checks pass. Basically this would result in things > being down after a reload, which defeats the point of the reload being > non-interruptive. > > I can confirm, we had this in a very early version, something like 1.0.x > and it was quickly changed! I've been using Alteon load balancers for > years and their health checks are slow. I remember that the persons in > charge for them were always scared to reboot them because the services > remained down for a long time after a reboot (seconds to minutes). So > we definitely don't want this to happen here. > > > I can think of 2 possible solutions: > 1) When the new process comes up, do an initial check on all servers > (just one) which have checks enabled. Use that one check as the verdict > for whether each server should be marked 'up' or 'down'. > > Till now that's exactly what's currently done. The servers are marked > "almost dead", so the first check gives the verdict. Initially we had > all checks started immediately. But it caused a lot of issues at several > places where there were a high number of backends or servers mapped to > the same hardware, because the rush of connection really caused the > servers to be flagged as down. So we started to spread the checks over > the longest check period in a farm. > > > Is there a way to enable this behavior? In my environment/configuration, > it causes absolutely no issue that all the checks be fired off at the same > time. > As it is right now, when haproxy starts up, it takes it quite a while to > discover which servers are down. > > -Patrick >

