On Tue, Jan 21, 2014 at 09:04:12PM -0500, Patrick Hemmer wrote:
> Personally I would not like that every server is considered down until
> after the health checks pass. Basically this would result in things
> being down after a reload, which defeats the point of the reload being
> non-interruptive.

I can confirm, we had this in a very early version, something like 1.0.x
and it was quickly changed! I've been using Alteon load balancers for
years and their health checks are slow. I remember that the persons in
charge for them were always scared to reboot them because the services
remained down for a long time after a reboot (seconds to minutes). So
we definitely don't want this to happen here.

> I can think of 2 possible solutions:
> 1) When the new process comes up, do an initial check on all servers
> (just one) which have checks enabled. Use that one check as the verdict
> for whether each server should be marked 'up' or 'down'.

Till now that's exactly what's currently done. The servers are marked
"almost dead", so the first check gives the verdict. Initially we had
all checks started immediately. But it caused a lot of issues at several
places where there were a high number of backends or servers mapped to
the same hardware, because the rush of connection really caused the
servers to be flagged as down. So we started to spread the checks over
the longest check period in a farm.

> After each
> server has been checked once, then signal the other process to shut down
> and start listening.

It is not really possible unfortunately, because we have to bind before
the fork (before losing privileges), and the poll loop cannot be used
before the fork.

> 2) Use the stats socket (if enabled) to pull the stats from the previous
> process. Use its health check data to pre-populate the health data of
> the new process. This one has a few drawbacks though. The server &
> backend names must match between the old and new config, and the stats
> socket has to be enabled. It would probably be harder to code as well,
> but I really don't know on that.

There was an old thread many years ago on this list where a somewhat
similar solution was proposed, which was quite simple but nobody worked
on it. The idea was to dump the servers status from the shutdown script
to a file upon reload, and to pass that file to the new process so that
it could parse it and find the relevant information there.

I must say I liked the principle because it could also be used as a
configuration trick to force certain servers' states at boot without
touching the configuration file for example.

I think it can easily be done for basic purposes. The issue is always
with adding/removing/renaming servers.

Right now the "official" server identifier is its numeric ID which can
be forced (useful for APIs and SNMP) or automatically assigned. Peers
use these IDs for state table synchronization for example. Ideally,
upon a reload, we should consider that IDs are used if they're forced,
otherwise names are used. That would cover only addition/removal when
IDs are not set, and renaming as well when IDs are set. And this works
for frontend and backends as well. Currently we don't have the
information saying that an ID was manually assigned, but it is a very
minor detail to add!

Willy


Reply via email to