Current solutions to the soft-restart-healthcheck-spread problem?

Jonathan Matthews Thu, 06 Mar 2014 07:16:37 -0800

Hi all -

[ tl;dr How do you stop haproxy using failed backend servers immediately
after reload?
Haproxy devs, please consider implementing a
consider-servers-initially-DOWN option! ]


I wonder if people could outline how they're dealing with the combination
of these two haproxy behaviours:

1) On restart/reload/disabled-server-now-enabled-via-admin-interface,
haproxy considers a server to be 1 health check away from going down, but
considers it *initially* up.

2) On restart/reload, haproxy spreads out each backend's(?) initial server
health checks over the entire health check interval.

(If I'm slightly off with either of those statements, please forgive the
inaccuracy and let it slide for the purposes of this discussion; do let me
know if I'm /meaningfully/ wrong of course!)

The combination of these facts in a high traffic environment seems to imply
that an unhealthy-but-just-enabled server which is listed last in an
haproxy backend may receive requests for a longer-than-expected period of
time, resulting in a non-trivial number of requests failing.

In such an environment, where multiple load balancers are involved and can
be reloaded sequentially (such as mine!), it would be preferable to take a
pessimistic approach and /not/ expose servers to traffic until you're
positive that the backend is healthy, rather than haproxy's current
default-optimism approach.

I've been considering some methods to deal with this, but haven't got a
working config yet. It's getting somewhat convoluted and stick-table heavy,
so I thought I'd ask everyone:

Where you have decided that this is something you actually need to deal
with, *how* are you doing that? (I totally recognise that the combination
of a frequent health check interval and non-insane traffic volumes may mask
this issue, leading many -- myself included in previous jobs! -- not to
consider it a problem in the first place)

It's worth pointing out that I /believe/ this situation could be easily
solved (operationally) by a global, per-backend or per-server option which
switches on the pessimistic behaviour mentioned above. I recognise that
this may not be easy from an /implementation/ perspective, of course.
[Willy: any chance of an option to start each server as if it were down,
but being 1 check away from going up, rather than the opposite? :-)]

It's also worth pointing out that, whilst the "persist haproxy state over
soft restarts" concept that's been mentioned previously on list would solve
this for orderly restarts, it wouldn't solve it for crashes, reboots or
otherwise. I think the option I mentioned above would be one way to solve
it nicely, for multiple use cases.

[ For a *not* nice solution, I'll post a follow up when I get my
stick-table concept going. It's /nasty/. IMHO. Don't make me put it into
production! ;-) ]

Cheers,
Jonathan

Current solutions to the soft-restart-healthcheck-spread problem?

Reply via email to