On Thu, Nov 28, 2013 at 03:41:15PM +0100, Willy Tarreau wrote:
> Hi guys,
>
> I'm CCing the persons who've been most involved in the evolutions of the
> health check system and who might have strong opinions about what to take
> care of.
>
> The recent inclusion of the agent-check has unveiled how much the current
> check subsystem is a complex mess full of corner cases. Igor sent me some
> screen captures of abnormal stats pages with servers marked DRAIN in brown
> while they were just set to weight 0 in the config, etc... The reason is
> the ambiguity we have in defining states because we have added more and
> more exceptions and some combinations are not properly documented. Thus
> I'm proposing to perform some changes to remain compatible with what we
> had till now and ensure that both the agent and the CLI work in a coherent
> and understandable way.
>
> First, brief summary of the current situation :
>
> - MAINT state has the highest precedence. It can be enforced from the CLI.
> A server can be turned from any state to MAINT and the stats page will
> report MAINT. Checks are automatically disabled in this state. This
> state may be inherited from tracked servers. The stats page reports such
> servers in brown whatever their previous check results. Technically,
> this state is represented as a flag on the server which is checked before
> everything else.
>
> - DRAIN is the state where either the user (via the config or CLI) or the
> agent explicitly forces the server weight to zero. It has the second
> highest precedence since it can be enforced from the CLI and is
> persistent.
> This state should appear only on servers which are technically UP, so they
> can still receive some traffic. In practice we don't need to "store" the
> DRAIN state, a server should be considered in this state when it's UP (or
> unchecked) and its weight is zero. It's important to keep a special color
> for this case (currently blue) on the stats page for this. Writing "DRAIN"
> instead of "UP" also helps spotting it.
>
> - UNCHECKED is the state where the server is enabled and never performs
> any health checks. It's not in DRAIN state either when reported in this
> state. That's the gray state on the stats page.
>
> - NOLB cannot be forced from the CLI nor the agent. It's equivalent to a
> DRAIN mode except that it is deduced from the result of a health check
> ("404") and does not affect the weight. It is maintained until the check
> reports a different state, or until the server goes down, where it
> automatically clears. It may be inherited from tracked servers. It is only
> used in HTTP mode with "http-check disable-on-404" at the moment.
>
> - UP is the state where the server is consistently seen as OK without any of
> the exceptions above. This state is altered by health checks. The agent
> might switch away from it, until a new check changes this. The CLI must
> provide the ability to do the same. The agent can currently force the
> server to be seen up by emitting a weighted percentage.
>
> - UP/GOINGDOWN is the state where the server was previously seen as OK but
> recently failed less than "fail" checks. It's without any of the
> exceptions
> above.
>
> - DOWN is the state where the server is consistently seen as KO without any
> of the exceptions above. The agent must be able to temporarily force the
> server into this state until next health check might change it again. The
> CLI must provide the ability to do the same.
>
> - DOWN/GOINGUP is the state where the server was previously seen as KO but
> recently succeeded less than "rise" checks. It's without any of the
> exceptions above.
>
> Right now MAINT state is propagated from tracked servers, NOLB is propagated
> as well, but not DRAIN. Changing a server's weight does not affect the
> tracking
> servers' weight, and it definitely must not.
>
> At the moment, only checked servers may be tracked, but since we can now
> enable/disable a server, it would make sense to allow a server to track
> unchecked servers as well so that a single "enable" or "disable" applies
> to the whole list of trackers.
>
> Right now what is propagated across tracked servers is :
> - MAINT
> - NOLB
> - UP/DOWN and DOWN/UP transitions
>
> We should consider that the agent provides exactly the same capabilities as
> the
> CLI, because it is used to alter the server's behaviour beyond what the config
> plans, exactly as the CLI does. This means several things :
>
> - weights are per-server, so a weight change learned from an agent is not
> propagated to tracking servers.
>
> - CLI needs the ability to set a server up or down just like the agent. This
> is currently not possible.
>
> - CLI's set weight does not turn the server up while agent's weight turns it
> on, I think we need to align the agent on the CLI here.
>
> - we'll later have to add a new directive "agent-track", comparable to
> "track"
> to propagate agent changes to tracking servers.
>
> - CLI always has the final word because from the CLI we can disable the
> agent.
>
> The "DRAIN" state is very similar to the NOLB state except that it explicitly
> forces the weight to zero, causing the loss of the previous weight.
>
> So probably we should change a few things in the agent :
> - have the weight announces not change the server's state, just the weight,
> just like the CLI. This is useful to announce the server's load only
> without interfering with checks ;
>
> - have DRAIN and NOLB be exactly the same thing. That means that an agent
> responds DRAIN when it just wants the server not to receive new
> connections
> regardless of its operating state. This state will be ignored when the
> server is already down, and DOWN will follow.
Its unclear to me what the difference would be between DRAIN/NOLB
and setting the weight to 0. Is the difference that the weight would
be retained?
It is also not clear to me at what point DOWN would follow.
> - support an "up" command to immediately turn the server up and reverse the
> effects of "down", allowing it to run without health checks and just the
> agent.
It is not clear to me how up would work in a situation where a server
had been set to NOLB/DRAIN but was not yet DOWN. This might be
because I don't understand how to transition from NOLB/DRAIN to DOWN
would occur.
>
> Then these changes will follow for the CLI :
>
> - the CLI must gain support for setting the NOLB/DRAIN state.
>
> - the CLI must also support "set server xxx up/down".
>
> We'd report in blue on the stats page servers that are either in NOLB state or
> that have a weight set to zero, as it has been done till now.
>
> What do you think ? I'm willing to perform the changes but I want to be sure
> that it will match what users expect, especially for the agent string format.
>
> Thanks,
> Willy
>