Re: Health check hell

Simon Horman Mon, 02 Dec 2013 03:57:56 -0800

On Thu, Nov 28, 2013 at 03:41:15PM +0100, Willy Tarreau wrote:
> Hi guys,
> 
> I'm CCing the persons who've been most involved in the evolutions of the
> health check system and who might have strong opinions about what to take
> care of.
> 
> The recent inclusion of the agent-check has unveiled how much the current
> check subsystem is a complex mess full of corner cases. Igor sent me some
> screen captures of abnormal stats pages with servers marked DRAIN in brown
> while they were just set to weight 0 in the config, etc... The reason is
> the ambiguity we have in defining states because we have added more and
> more exceptions and some combinations are not properly documented. Thus
> I'm proposing to perform some changes to remain compatible with what we
> had till now and ensure that both the agent and the CLI work in a coherent
> and understandable way.
> 
> First, brief summary of the current situation :
> 
>   - MAINT state has the highest precedence. It can be enforced from the CLI.
>     A server can be turned from any state to MAINT and the stats page will
>     report MAINT. Checks are automatically disabled in this state. This
>     state may be inherited from tracked servers. The stats page reports such
>     servers in brown whatever their previous check results. Technically,
>     this state is represented as a flag on the server which is checked before
>     everything else.
> 
>   - DRAIN is the state where either the user (via the config or CLI) or the
>     agent explicitly forces the server weight to zero. It has the second
>     highest precedence since it can be enforced from the CLI and is 
> persistent.
>     This state should appear only on servers which are technically UP, so they
>     can still receive some traffic. In practice we don't need to "store" the
>     DRAIN state, a server should be considered in this state when it's UP (or
>     unchecked) and its weight is zero. It's important to keep a special color
>     for this case (currently blue) on the stats page for this. Writing "DRAIN"
>     instead of "UP" also helps spotting it.
> 
>   - UNCHECKED is the state where the server is enabled and never performs
>     any health checks. It's not in DRAIN state either when reported in this
>     state. That's the gray state on the stats page.
> 
>   - NOLB cannot be forced from the CLI nor the agent. It's equivalent to a
>     DRAIN mode except that it is deduced from the result of a health check
>     ("404") and does not affect the weight. It is maintained until the check
>     reports a different state, or until the server goes down, where it
>     automatically clears. It may be inherited from tracked servers. It is only
>     used in HTTP mode with "http-check disable-on-404" at the moment.
> 
>   - UP is the state where the server is consistently seen as OK without any of
>     the exceptions above. This state is altered by health checks. The agent
>     might switch away from it, until a new check changes this. The CLI must
>     provide the ability to do the same. The agent can currently force the
>     server to be seen up by emitting a weighted percentage.
> 
>   - UP/GOINGDOWN is the state where the server was previously seen as OK but
>     recently failed less than "fail" checks. It's without any of the 
> exceptions
>     above.
> 
>   - DOWN is the state where the server is consistently seen as KO without any
>     of the exceptions above. The agent must be able to temporarily force the
>     server into this state until next health check might change it again. The
>     CLI must provide the ability to do the same.
> 
>   - DOWN/GOINGUP is the state where the server was previously seen as KO but
>     recently succeeded less than "rise" checks. It's without any of the
>     exceptions above.
> 
> Right now MAINT state is propagated from tracked servers, NOLB is propagated
> as well, but not DRAIN. Changing a server's weight does not affect the 
> tracking
> servers' weight, and it definitely must not.
> 
> At the moment, only checked servers may be tracked, but since we can now
> enable/disable a server, it would make sense to allow a server to track
> unchecked servers as well so that a single "enable" or "disable" applies
> to the whole list of trackers.
> 
> Right now what is propagated across tracked servers is :
>   - MAINT
>   - NOLB
>   - UP/DOWN and DOWN/UP transitions
> 
> We should consider that the agent provides exactly the same capabilities as 
> the
> CLI, because it is used to alter the server's behaviour beyond what the config
> plans, exactly as the CLI does. This means several things :
> 
>   - weights are per-server, so a weight change learned from an agent is not
>     propagated to tracking servers.
> 
>   - CLI needs the ability to set a server up or down just like the agent. This
>     is currently not possible.
> 
>   - CLI's set weight does not turn the server up while agent's weight turns it
>     on, I think we need to align the agent on the CLI here.
> 
>   - we'll later have to add a new directive "agent-track", comparable to 
> "track"
>     to propagate agent changes to tracking servers.
> 
>   - CLI always has the final word because from the CLI we can disable the 
> agent.
> 
> The "DRAIN" state is very similar to the NOLB state except that it explicitly
> forces the weight to zero, causing the loss of the previous weight.
> 
> So probably we should change a few things in the agent :
>   - have the weight announces not change the server's state, just the weight,
>     just like the CLI. This is useful to announce the server's load only
>     without interfering with checks ;
> 
>   - have DRAIN and NOLB be exactly the same thing. That means that an agent
>     responds DRAIN when it just wants the server not to receive new 
> connections
>     regardless of its operating state. This state will be ignored when the
>     server is already down, and DOWN will follow.


Its unclear to me what the difference would be between DRAIN/NOLB
and setting the weight to 0. Is the difference that the weight would
be retained?

It is also not clear to me at what point DOWN would follow.

>   - support an "up" command to immediately turn the server up and reverse the
>     effects of "down", allowing it to run without health checks and just the
>     agent.

It is not clear to me how up would work in a situation where a server
had been set to NOLB/DRAIN but was not yet DOWN. This might be
because I don't understand how to transition from NOLB/DRAIN to DOWN
would occur.

> 
> Then these changes will follow for the CLI :
> 
>   - the CLI must gain support for setting the NOLB/DRAIN state.
> 
>   - the CLI must also support "set server xxx up/down".
> 
> We'd report in blue on the stats page servers that are either in NOLB state or
> that have a weight set to zero, as it has been done till now.
> 
> What do you think ? I'm willing to perform the changes but I want to be sure
> that it will match what users expect, especially for the agent string format.
> 
> Thanks,
> Willy
>

Re: Health check hell

Reply via email to