Hi,

On Tue, Dec 29, 2015 at 10:23:35AM +0100, [email protected] wrote:
> Hi,
> 
> I was able to reproduce the health check bug and create a small patch.
> 
> Before patch:
> - Health check for server servers/server1 succeeded, reason: External 
> check passed, code: 0, check duration: 6ms, status: 1/3 DOWN.
> - Health check for server servers/server1 succeeded, reason: External 
> check passed, code: 0, check duration: 4ms, status: 2/3 DOWN.
> - Health check for server servers/server1 failed, reason: External check 
> error, code: 1, check duration: 4ms, status: 2/3 DOWN.
> - Health check for server servers/server1 succeeded, reason: External 
> check passed, code: 0, check duration: 3ms, status: 1/1 UP.
> 
> After patch:
> - Health check for server servers/server1 succeeded, reason: External 
> check passed, code: 0, check duration: 3ms, status: 1/3 DOWN.
> - Health check for server servers/server1 succeeded, reason: External 
> check passed, code: 0, check duration: 4ms, status: 2/3 DOWN.
> - Health check for server servers/server1 failed, reason: External check 
> error, code: 1, check duration: 4ms, status: 0/3 DOWN.
> - Health check for server servers/server1 succeeded, reason: External 
> check passed, code: 0, check duration: 4ms, status: 1/3 DOWN.
> - Health check for server servers/server1 succeeded, reason: External 
> check passed, code: 0, check duration: 5ms, status: 2/3 DOWN.
> - Health check for server servers/server1 succeeded, reason: External 
> check passed, code: 0, check duration: 4ms, status: 1/1 UP.
> 
> here's the patch:
> --- src/checks.c      2015-12-29 09:53:50.971741769 +0100
> +++ checks.c  2015-12-29 09:54:19.211743665 +0100
> @@ -272,11 +272,12 @@
>                * cause the server to be marked down.
>                */
>               if ((!(check->state & CHK_ST_AGENT) ||
> -                 (check->status >= HCHK_STATUS_L57DATA)) &&
> -                 (check->health >= check->rise)) {
> -                     s->counters.failed_checks++;
> -                     report = 1;
> -                     check->health--;
> +                 (check->status >= HCHK_STATUS_L57DATA))) {
> +                     if (check->health >= check->rise) {
> +                             s->counters.failed_checks++;
> +                             report = 1;
> +                             check->health--;
> +                     }
>                       if (check->health < check->rise)
>                               check->health = 0;
>               }

That's interesting. I missed this bug report. Do you know for how
long it has been existing ? I suspect we introduced it in 1.5 when
refactoring the code to include the agent check. Have you checked
if we don't have the same bug in the other direction, when the
check succeeds after a failure when the server is still up ? This
code tends to be a bit symmetrical which is why I'm suspecting
the same issue.

Thanks,
Willy


Reply via email to