Re: [PATCH 5/5] dynamic health check

Willy Tarreau Thu, 31 Jan 2013 23:22:53 -0800

Hi Simon,

On Fri, Feb 01, 2013 at 01:56:01PM +0900, Simon Horman wrote:
> Hi Malcolm, Hi Willy,
> 
> after a bit of a hiatus I'd like to restart this discussion.

Cool, I wanted to ping you on this last week-end but forgot to do so !

> On Mon, Dec 24, 2012 at 10:23:15AM +0100, Willy Tarreau wrote:
> > Hi Malcolm,
> > 
> > On Mon, Dec 24, 2012 at 09:06:25AM +0000, Malcolm Turnbull wrote:
> > > Willy / Simon,
> > > 
> > > I'm very happy to add a down option, my original thought was that you
> > > would use the standard health checks as well as the dynamic agent for
> > > changing the weight.
> > 
> > That's what I thought I initially understood from our discussion a few
> > months ago but then your post of the specs last week slightly confused
> > me as I understood you needed this as a dedicated check. I think it was
> > the same for Simon.
> 
> Sorry, I think that the problem here lies in my understanding of what is
> desired.

No problem, we were several ones to get confused.

> > > As you may for example want a specific HAproxy SMTP health check + use
> > > the dynamic weighting agent.
> > 
> > Exactly. But then we have two options :
> >   - retrieve the information from the checked port (easy for HTTP or TCP)
> >   - retrieve the information from a dedicated port => this involves a
> >     second task to do this, with its own check intervals.
> > 
> > The latter doesn't seem stupid at all, quite the opposite in fact, but
> > it will require more settings on the server line. However it comes with
> > a benefit, it is that when the agent returns "disable", checks are
> > disabled on the real port, but then we could have the agent continue to
> > be checked and later return a valid result again.
> >
> > > I'm not sure if that would cause some coding issues if the health
> > > checks say 'Down' and the agent says 50%? (I would assume haproxy
> > > health checks take priority?)
> > 
> > Status and weights are orthogonal. The real check should have precedence.
> > 
> > > Or if the agent says Down but the HAProxy health check says up?
> > 
> > I think it should be ANDed. This could help provide a first implementation
> > of multi-port checks after all.
> 
> That sounds reasonable.
> 
> > > I've certainly happy for Down to be added as an option with a
> > > description string.
> > > Also I'm assuming that later (the dynamic agent) could easily be
> > > extended to an http style get check rather than TCP (lb-agent-chk)  if
> > > users prefer to write an HTTP server application to integrate with it
> > > (Kemp and Barracuda support this method).
> 
> On the topic of of down. I think that Willy's proposal is
> entirely reasonable. However its unclear to me if disable should also
> be supported or not.

The disable mode is very problematic : if a server accidently returns it,
there is no way to roll back except a manual intervention on the load
balancers. Also there is a high risk that the backup LB will be forgotten
in such an operation. I have no technical worries here, just operational
ones. If we run agent checks on a dedicated port in parallel to health
checks, this is different, because we could ensure that such checks could
still be running when the server is disabled so that the agent can change
the mode again. So maybe a first version should not support disable and a
later one could support it ?

Also, I believe that in another thread we discussed about supporting a
new status (eg: STOPPED) which differs from DOWN in that it means the
service was intentionally stopped and did not crash. We can't support
this well right now (just map it do down) but I think it's important
that people can design their agents for this. Similarly, a "FAIL"
status could be useful in the usual situations where a server is inoperant
due to external conditions but could appear valid. The common example is
the mail server which fails to receive e-mails because the FS is full.
Everything works except the service cannot be delivered. There is nothing
to restart, the issue can go away by itself, etc... We'd map this to DOWN
again, but I think some users may later prefer to have a dedicated status
in the agent's language. So we should probably plan it in the language in
order to avoid ugly patches here and there.

> > That's what I'm commonly observing too. Even right now, there are a lot
> > of users who use httpchk for services that are not HTTP at all, but they
> > have a very simple agent responding to checks.
> > 
> > So now we have to decide what to do. I think Simon's code already provides
> > some useful features (assuming we support "down"). It should probably be
> > extended later to support combined checks.
> > 
> > In my opinion, this could be done in three steps :
> > 
> >   1) we merge Simon's work with the "option lb-agent-chk" directive which
> >      *replaces* the health check method with this one ;
> > 
> >   2) we implement "agent-port" and "agent-interval" on the server lines to
> >      automatically enable the agent to be run on another port even when a
> >      different check is running ;
> > 
> >   3) we implement "http-check agent-hdr <name>" to retrieve the agent string
> >      from an HTTP header for HTTP checks ;
> > 
> > That way we always support exactly the same syntax but can retrieve the
> > required information at different places depending on the checks. Does
> > that sound good to you ?
> 
> That sounds entirely reasonable to me.

Nice!

Best regards,
Willy

Re: [PATCH 5/5] dynamic health check

Reply via email to