Hi, On Tue, Dec 29, 2015 at 10:23:35AM +0100, [email protected] wrote: > Hi, > > I was able to reproduce the health check bug and create a small patch. > > Before patch: > - Health check for server servers/server1 succeeded, reason: External > check passed, code: 0, check duration: 6ms, status: 1/3 DOWN. > - Health check for server servers/server1 succeeded, reason: External > check passed, code: 0, check duration: 4ms, status: 2/3 DOWN. > - Health check for server servers/server1 failed, reason: External check > error, code: 1, check duration: 4ms, status: 2/3 DOWN. > - Health check for server servers/server1 succeeded, reason: External > check passed, code: 0, check duration: 3ms, status: 1/1 UP. > > After patch: > - Health check for server servers/server1 succeeded, reason: External > check passed, code: 0, check duration: 3ms, status: 1/3 DOWN. > - Health check for server servers/server1 succeeded, reason: External > check passed, code: 0, check duration: 4ms, status: 2/3 DOWN. > - Health check for server servers/server1 failed, reason: External check > error, code: 1, check duration: 4ms, status: 0/3 DOWN. > - Health check for server servers/server1 succeeded, reason: External > check passed, code: 0, check duration: 4ms, status: 1/3 DOWN. > - Health check for server servers/server1 succeeded, reason: External > check passed, code: 0, check duration: 5ms, status: 2/3 DOWN. > - Health check for server servers/server1 succeeded, reason: External > check passed, code: 0, check duration: 4ms, status: 1/1 UP. > > here's the patch: > --- src/checks.c 2015-12-29 09:53:50.971741769 +0100 > +++ checks.c 2015-12-29 09:54:19.211743665 +0100 > @@ -272,11 +272,12 @@ > * cause the server to be marked down. > */ > if ((!(check->state & CHK_ST_AGENT) || > - (check->status >= HCHK_STATUS_L57DATA)) && > - (check->health >= check->rise)) { > - s->counters.failed_checks++; > - report = 1; > - check->health--; > + (check->status >= HCHK_STATUS_L57DATA))) { > + if (check->health >= check->rise) { > + s->counters.failed_checks++; > + report = 1; > + check->health--; > + } > if (check->health < check->rise) > check->health = 0; > }
That's interesting. I missed this bug report. Do you know for how long it has been existing ? I suspect we introduced it in 1.5 when refactoring the code to include the agent check. Have you checked if we don't have the same bug in the other direction, when the check succeeds after a failure when the server is still up ? This code tends to be a bit symmetrical which is why I'm suspecting the same issue. Thanks, Willy

