Hello Cornelius,

On Mon, Oct 06, 2014 at 12:20:14PM +0200, Cornelius Riemenschneider wrote:
> Hello,
> 
> we use haproxy 1.5.3 quite successfully with the new agent-checks (a great 
> feature!)
> 
> However, recently we started noticing that our servers were going down with 
> »DOWN (agent)« as status and LastChk »CHECKED in 0ms«.
> 
> Our implementation does not issue the string »down«, only »drain« or a 
> percentage.
> 
> Is there any other way  to get haproxy to think the agent issued a »down« 
> command?
> 
> I played a bit around, and while production  is able to get haproxy into the 
> »DOWN (agent)«-state without issuing »down«, I'm not.
> 
> Do you have any hints what could go wrong?

>From what I'm seeing in the code, only "down", "stopped" and "fail" can cause
that status to appear.

However, what I'm seeing in the code is that we can issue this string on the
stats page when the agent is enabled AND the state is down. Is it possible
that you're running both agent and a check ? If so, I can imagine the situation
where the main check fails and the state is marked down, leading to this output.

I think we should improve the reporting here to avoid this confusion, because
in practice we don't know whether it's the agent or the main check which caused
the down status. This combination probably stems from the original design where
we did not expect to mix both agent and regular checks. We could maybe use the
combination of agent->health == 0 with that to ensure we only report
"DOWN(agent)" when we know it's the agent which forces the state down.

If you manage to reproduce the issue, I'd suggest you to test this patch which
I think will fix this inaccurate reporting :

diff --git a/src/dumpstats.c b/src/dumpstats.c
index ebf66ec..ad85571 100644
--- a/src/dumpstats.c
+++ b/src/dumpstats.c
@@ -3107,7 +3107,7 @@ static int stats_dump_sv_stats(struct stream_interface 
*si, struct proxy *px, in
                        chunk_appendf(&trash, "%s ", human_time(now.tv_sec - 
sv->last_change, 1));
                        chunk_appendf(&trash, "MAINT");
                }
-               else if ((ref->agent.state & CHK_ST_ENABLED) && (ref->state == 
SRV_ST_STOPPED)) {
+               else if ((ref->agent.state & CHK_ST_ENABLED) && (ref->state == 
SRV_ST_STOPPED) && ref->agent.health < ref->agent.rise) {
                        chunk_appendf(&trash, "%s ", human_time(now.tv_sec - 
ref->last_change, 1));
                        /* DOWN (agent) */
                        chunk_appendf(&trash, srv_hlt_st[1], "GCC: your 
-Werror=format-security is bogus, annoying, and hides real bugs, I don't thank 
you, really!");

> Btw, I noticed that when you output 0%, the server goes into drain, and with
> a percentage >0 it goes back up, but when you output »drain« the server only
> goes up on »up«, not on a percentage >0.

That's expected. A zero weight is displayed as "DRAIN" and is strictly
equivalent in terms of behaviour. However, there's an administrative
DRAIN state which you enter with the "drain" keyword without affecting
the weight.

> Could you maybe that in the documentation?

Do you think it would be enough if we add in the doc that the stats page
also reports weight 0 as "DRAIN" ?

Thanks for your feedback!
Willy


Reply via email to