Re: Health check hell

Willy Tarreau Wed, 04 Dec 2013 10:19:09 -0800

Hi Malcolm,

On Wed, Dec 04, 2013 at 03:05:41PM +0000, Malcolm Turnbull wrote:
> Hi Willy,
> 
> Sorry for the lack of response from the Loadbalancer.org end, I must
> confess we were getting a bit confused by the descriptions :-).


I'm not surprized! I got even more confused when trying to debug some
of the issues Igor reported and not understanding what would act on
what, what would be propagated from tracked servers, etc... Anyway,
writing the design limitations here and explaining them helps us
get rid of them.

> The only thing in mu mind to be aware of is the design decision of the
> agent to report DOWN or DRAIN on every agent request until the agent
> starts responding with x% again..
> Was because if you send an UP response from the agent how does the
> agent know that HAProxy has read that value and acted on it? It would
> need to know when it was safe to start responding with x% again?

OK I get your point. My point was to emit two things at once.
Eg: "UP 10%".

We could have the agent specification state that the response format
may include optional state words, optionally followed by a weight.
That way we can have agents which return state only, weight only or
both.

> Our primary requirement at Loadbalancer.org is for the first scenario
> i.e. dynamic weight adjustment and uses standard health checks:
> 
>   - inform the load balancer about the server's load to adjust the
>     weights, but not interact with the service's state which is
>     monitored using regular checks. It basically replaces the job
>     of the admin who would constantly re-adjust weights depending
>     on the servers load.

I agree that this should be by far the most common use especially in
combination with the service check. That's the reason why I'm embarrassed
by the fact that we put the server UP when returning a percentage because
it means the agent returning the load has to be aware of the service state
which is not logical.

> The following usage case makes sense, but isn't really a priority for us:
> 
>   - offer a complete health check system to services which are not
>     easily checkable. In this case they would simply be used without
>     a regular check. This is more a service-level approach and not
>     a server-level one.

It's not my priority either though I know some people will want it when
they already have to use an agent and need to deploy a second script to
check the health of a specific service : they won't find it convenient
to run two scripts on different ports, one for the state and one for the
load.

> The third logical function for us was:
> 
> For a Windows administrator to have a simple GUI DRAIN/HALT button in
> the agent, to enable quick local maintenance on the Windows backend
> server without having to log into the load balancer in order to set
> maintenance mode.

Hehe, just like the 404 feature in HTTP :-)

> But again this is not really a priority with us as you say it clashes
> with the CLI DRAIN logic....

It does not exactly clash, it depends how we define it. I discovered there
are 3 dimensions which are managed by a single agent while we initially
thought there were only two. The agent can :

    - declare a service's state (up or down)
    - declare an administrative state (drain/ready)
    - declare a system load (weight)

But at the moment with the language we defined, each action changes two
of them at once, which is a big problem.

And depending on what system the agent will be deployed on, not all these
features will be used together. I expect that admin state and load will be
the more common ones for an agent. Your enumeration tends to support this.

So let's try with something like this for the agent syntax :

  [keywords]* [weight]

  Where [keywords] are optional and made of :

     "up" : report that the service is UP.
     "down", "stopped", "fail" : report the service down with these causes
     "drain" : don't change the state, nor the weight, just set DRAIN mode.
     "maint" : don't change the state, nor the weight, just set MAINT mode
     "ready" : don't change the state, nor the weight, just leave MAINT and 
DRAIN modes.

  And [weight] is optional and in the form "xxx%" to report the desired
  weight for this server relative to the configured one in the config.

Thus the following examples might illustrate it better :

   "up"        : declare the server up, don't change the configured weight
   "up 50%"    : declare the server up, set weight to 50%
   "50%"       : don't touch the server state, just set the weight to 50%
   "drain"     : don't touch the state, nor weight, just switch to drain mode.
   "maint"     : force maintenance mode.
   "drain 20%" : drain mode, adjust weight to 20% (not used in this mode but
                 will avoid complex logics in agent scripts)
   "ready 30%" : leave maint/drain modes, start at 30% weight.
   "up ready 40%" : the agent does the 3 things at once and says the service is 
OK.
   "stopped drain 10%" : the agent does the 3 things at once and indicates that 
the
                         server is now down after drain mode.

I remember we initially refrained from allowing the "maint" mode from the
agent in its first version because it was planned as a regular check and
we didn't want it to be stuck in this mode. But now that the agent runs
on its own, it makes much more sense since it will continue to be checked.

With this, we can also consider that if a regular check is configured on the
server, then the state changes are ignored from the agent. This greatly
simplifies deployments relying on a single agent for multiple services
even if this agent was initially deployed for a specific service.

We would have to improve the CLI and the stats interface to match that. We'd
change the "soft stop" in the stats interface to act on the DRAIN mode instead
of the weight. It would provide the same effect as today but in a more
consistent way.

Proceeding like this, I can easily imagine that most agents will simply
read a small file containing the admin state (maint/drain/ready) and
that others will only report the idle CPU measure.

What do you think ?

Thanks,
Willy

Re: Health check hell

Reply via email to