Re: Current solutions to the soft-restart-healthcheck-spread problem?

Willy Tarreau Thu, 06 Mar 2014 23:02:23 -0800

Hi Jonathan,

On Thu, Mar 06, 2014 at 03:15:25PM +0000, Jonathan Matthews wrote:
> Hi all -
> 
> [ tl;dr How do you stop haproxy using failed backend servers immediately
> after reload?
> Haproxy devs, please consider implementing a
> consider-servers-initially-DOWN option! ]
> 
> I wonder if people could outline how they're dealing with the combination
> of these two haproxy behaviours:
> 
> 1) On restart/reload/disabled-server-now-enabled-via-admin-interface,
> haproxy considers a server to be 1 health check away from going down, but
> considers it *initially* up.
> 
> 2) On restart/reload, haproxy spreads out each backend's(?) initial server
> health checks over the entire health check interval.
> 
> (If I'm slightly off with either of those statements, please forgive the
> inaccuracy and let it slide for the purposes of this discussion; do let me
> know if I'm /meaningfully/ wrong of course!)
> 
> The combination of these facts in a high traffic environment seems to imply
> that an unhealthy-but-just-enabled server which is listed last in an
> haproxy backend may receive requests for a longer-than-expected period of
> time, resulting in a non-trivial number of requests failing.
> 
> In such an environment, where multiple load balancers are involved and can
> be reloaded sequentially (such as mine!), it would be preferable to take a
> pessimistic approach and /not/ expose servers to traffic until you're
> positive that the backend is healthy, rather than haproxy's current
> default-optimism approach.
> 
> I've been considering some methods to deal with this, but haven't got a
> working config yet. It's getting somewhat convoluted and stick-table heavy,
> so I thought I'd ask everyone:
> 
> Where you have decided that this is something you actually need to deal
> with, *how* are you doing that? (I totally recognise that the combination
> of a frequent health check interval and non-insane traffic volumes may mask
> this issue, leading many -- myself included in previous jobs! -- not to
> consider it a problem in the first place)
> 
> It's worth pointing out that I /believe/ this situation could be easily
> solved (operationally) by a global, per-backend or per-server option which
> switches on the pessimistic behaviour mentioned above. I recognise that
> this may not be easy from an /implementation/ perspective, of course.
> [Willy: any chance of an option to start each server as if it were down,
> but being 1 check away from going up, rather than the opposite? :-)]


I'm adding this to the todo list. In fact, this mode was chosen more than
10 years ago after having been forced to live with equipments doing the
exact opposite (what you're asking for) : you start the equipment, it
receives traffic and drops everything because no server is up yet. With
the current behaviour, even if you have one dead server in the farm, the
server is properly distributed to valid servers and the dead server
causes a redispatch after a few retries.

But I agree, we need to have options to start up by default or down by
default.

Concerning the start of health checks, I'm now thinking that we could
have a global parameter indicating the maximum distance between the
first and the last health check. It would probably satisfy all users.

> It's also worth pointing out that, whilst the "persist haproxy state over
> soft restarts" concept that's been mentioned previously on list would solve
> this for orderly restarts, it wouldn't solve it for crashes, reboots or
> otherwise. I think the option I mentioned above would be one way to solve
> it nicely, for multiple use cases.

Yes but nobody had the time to work on it yet. I just wanted to have the
ability to send a state dump to a file (eg: "show servers" on the CLI)
and feed this format on the input of the new process. It would be very
simple to do and very efficient.

> [ For a *not* nice solution, I'll post a follow up when I get my
> stick-table concept going. It's /nasty/. IMHO. Don't make me put it into
> production! ;-) ]

OK

cheers,
Willy

Re: Current solutions to the soft-restart-healthcheck-spread problem?

Reply via email to