Hi all - [ tl;dr How do you stop haproxy using failed backend servers immediately after reload? Haproxy devs, please consider implementing a consider-servers-initially-DOWN option! ]
I wonder if people could outline how they're dealing with the combination of these two haproxy behaviours: 1) On restart/reload/disabled-server-now-enabled-via-admin-interface, haproxy considers a server to be 1 health check away from going down, but considers it *initially* up. 2) On restart/reload, haproxy spreads out each backend's(?) initial server health checks over the entire health check interval. (If I'm slightly off with either of those statements, please forgive the inaccuracy and let it slide for the purposes of this discussion; do let me know if I'm /meaningfully/ wrong of course!) The combination of these facts in a high traffic environment seems to imply that an unhealthy-but-just-enabled server which is listed last in an haproxy backend may receive requests for a longer-than-expected period of time, resulting in a non-trivial number of requests failing. In such an environment, where multiple load balancers are involved and can be reloaded sequentially (such as mine!), it would be preferable to take a pessimistic approach and /not/ expose servers to traffic until you're positive that the backend is healthy, rather than haproxy's current default-optimism approach. I've been considering some methods to deal with this, but haven't got a working config yet. It's getting somewhat convoluted and stick-table heavy, so I thought I'd ask everyone: Where you have decided that this is something you actually need to deal with, *how* are you doing that? (I totally recognise that the combination of a frequent health check interval and non-insane traffic volumes may mask this issue, leading many -- myself included in previous jobs! -- not to consider it a problem in the first place) It's worth pointing out that I /believe/ this situation could be easily solved (operationally) by a global, per-backend or per-server option which switches on the pessimistic behaviour mentioned above. I recognise that this may not be easy from an /implementation/ perspective, of course. [Willy: any chance of an option to start each server as if it were down, but being 1 check away from going up, rather than the opposite? :-)] It's also worth pointing out that, whilst the "persist haproxy state over soft restarts" concept that's been mentioned previously on list would solve this for orderly restarts, it wouldn't solve it for crashes, reboots or otherwise. I think the option I mentioned above would be one way to solve it nicely, for multiple use cases. [ For a *not* nice solution, I'll post a follow up when I get my stick-table concept going. It's /nasty/. IMHO. Don't make me put it into production! ;-) ] Cheers, Jonathan

