On Tue, Jan 21, 2014 at 09:04:12PM -0500, Patrick Hemmer wrote: > Personally I would not like that every server is considered down until > after the health checks pass. Basically this would result in things > being down after a reload, which defeats the point of the reload being > non-interruptive.
I can confirm, we had this in a very early version, something like 1.0.x and it was quickly changed! I've been using Alteon load balancers for years and their health checks are slow. I remember that the persons in charge for them were always scared to reboot them because the services remained down for a long time after a reboot (seconds to minutes). So we definitely don't want this to happen here. > I can think of 2 possible solutions: > 1) When the new process comes up, do an initial check on all servers > (just one) which have checks enabled. Use that one check as the verdict > for whether each server should be marked 'up' or 'down'. Till now that's exactly what's currently done. The servers are marked "almost dead", so the first check gives the verdict. Initially we had all checks started immediately. But it caused a lot of issues at several places where there were a high number of backends or servers mapped to the same hardware, because the rush of connection really caused the servers to be flagged as down. So we started to spread the checks over the longest check period in a farm. > After each > server has been checked once, then signal the other process to shut down > and start listening. It is not really possible unfortunately, because we have to bind before the fork (before losing privileges), and the poll loop cannot be used before the fork. > 2) Use the stats socket (if enabled) to pull the stats from the previous > process. Use its health check data to pre-populate the health data of > the new process. This one has a few drawbacks though. The server & > backend names must match between the old and new config, and the stats > socket has to be enabled. It would probably be harder to code as well, > but I really don't know on that. There was an old thread many years ago on this list where a somewhat similar solution was proposed, which was quite simple but nobody worked on it. The idea was to dump the servers status from the shutdown script to a file upon reload, and to pass that file to the new process so that it could parse it and find the relevant information there. I must say I liked the principle because it could also be used as a configuration trick to force certain servers' states at boot without touching the configuration file for example. I think it can easily be done for basic purposes. The issue is always with adding/removing/renaming servers. Right now the "official" server identifier is its numeric ID which can be forced (useful for APIs and SNMP) or automatically assigned. Peers use these IDs for state table synchronization for example. Ideally, upon a reload, we should consider that IDs are used if they're forced, otherwise names are used. That would cover only addition/removal when IDs are not set, and renaming as well when IDs are set. And this works for frontend and backends as well. Currently we don't have the information saying that an ID was manually assigned, but it is a very minor detail to add! Willy

