Hi!

That's interesting. I missed this bug report. Do you know for how
long it has been existing ? I suspect we introduced it in 1.5 when
refactoring the code to include the agent check.

No, I do not know for how long this bug has been existing, sorry.

Have you checked
if we don't have the same bug in the other direction, when the
check succeeds after a failure when the server is still up ? This
code tends to be a bit symmetrical which is why I'm suspecting
the same issue.

Yes, I checked that the other way around is not introducing another bug: here's some log with rise 1 fall 5:

Health check for server servers/server1 succeeded, reason: External check passed, code: 0, check duration: 4ms, status: 5/5 UP. Health check for server servers/server1 failed, reason: External check error, code: 1, check duration: 4ms, status: 4/5 UP. Health check for server servers/server1 failed, reason: External check error, code: 1, check duration: 4ms, status: 3/5 UP. Health check for server servers/server1 failed, reason: External check error, code: 1, check duration: 4ms, status: 2/5 UP. Health check for server servers/server1 succeeded, reason: External check passed, code: 0, check duration: 4ms, status: 5/5 UP.

--
Pierre Zemb

Le 30.12.2015 07:44, Willy Tarreau a écrit :

That's interesting. I missed this bug report. Do you know for how
long it has been existing ? I suspect we introduced it in 1.5 when
refactoring the code to include the agent check. Have you checked
if we don't have the same bug in the other direction, when the
check succeeds after a failure when the server is still up ? This
code tends to be a bit symmetrical which is why I'm suspecting
the same issue.

Thanks,
Willy

On Tue, Dec 29, 2015 at 10:23:35AM +0100, [email protected] wrote:
Hi,

I was able to reproduce the health check bug and create a small patch.

Before patch:
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 6ms, status: 1/3 DOWN.
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 4ms, status: 2/3 DOWN.
- Health check for server servers/server1 failed, reason: External check
error, code: 1, check duration: 4ms, status: 2/3 DOWN.
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 3ms, status: 1/1 UP.

After patch:
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 3ms, status: 1/3 DOWN.
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 4ms, status: 2/3 DOWN.
- Health check for server servers/server1 failed, reason: External check
error, code: 1, check duration: 4ms, status: 0/3 DOWN.
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 4ms, status: 1/3 DOWN.
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 5ms, status: 2/3 DOWN.
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 4ms, status: 1/1 UP.

here's the patch:
--- src/checks.c        2015-12-29 09:53:50.971741769 +0100
+++ checks.c    2015-12-29 09:54:19.211743665 +0100
@@ -272,11 +272,12 @@
                 * cause the server to be marked down.
                 */
                if ((!(check->state & CHK_ST_AGENT) ||
-                   (check->status >= HCHK_STATUS_L57DATA)) &&
-                   (check->health >= check->rise)) {
-                       s->counters.failed_checks++;
-                       report = 1;
-                       check->health--;
+                   (check->status >= HCHK_STATUS_L57DATA))) {
+                       if (check->health >= check->rise) {
+                               s->counters.failed_checks++;
+                               report = 1;
+                               check->health--;
+                       }
                        if (check->health < check->rise)
                                check->health = 0;
                }

Reply via email to