Hi,
I was able to reproduce the health check bug and create a small patch.
Before patch:
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 6ms, status: 1/3 DOWN.
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 4ms, status: 2/3 DOWN.
- Health check for server servers/server1 failed, reason: External check
error, code: 1, check duration: 4ms, status: 2/3 DOWN.
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 3ms, status: 1/1 UP.
After patch:
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 3ms, status: 1/3 DOWN.
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 4ms, status: 2/3 DOWN.
- Health check for server servers/server1 failed, reason: External check
error, code: 1, check duration: 4ms, status: 0/3 DOWN.
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 4ms, status: 1/3 DOWN.
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 5ms, status: 2/3 DOWN.
- Health check for server servers/server1 succeeded, reason: External
check passed, code: 0, check duration: 4ms, status: 1/1 UP.
here's the patch:
--- src/checks.c 2015-12-29 09:53:50.971741769 +0100
+++ checks.c 2015-12-29 09:54:19.211743665 +0100
@@ -272,11 +272,12 @@
* cause the server to be marked down.
*/
if ((!(check->state & CHK_ST_AGENT) ||
- (check->status >= HCHK_STATUS_L57DATA)) &&
- (check->health >= check->rise)) {
- s->counters.failed_checks++;
- report = 1;
- check->health--;
+ (check->status >= HCHK_STATUS_L57DATA))) {
+ if (check->health >= check->rise) {
+ s->counters.failed_checks++;
+ report = 1;
+ check->health--;
+ }
if (check->health < check->rise)
check->health = 0;
}
--
Pierre Zemb
Hi,
I've not forgotten you, I'm just running out of time.
Baptiste
On Tue, Sep 29, 2015 at 5:43 PM, <[email protected]> wrote:
Le 2015-08-28 16:40, Baptiste a =C3=A9crit :
Le 28 ao=C3=BBt 2015 15:45, <[email protected]> a =C3=A9crit :
>
> Hello,
>
> We have tcp-check configured on some backends, which works fine,
except when service is flapping.
>
> If the backend server is in transitional state, for example
transitionally DOWN (going up), the counter is not reset to 0 if
tcp-check give a KO state between some OK state. The result is that if
the service is flapping, backend become up for a few seconds quite
often, even if all OK state are not consecutives.
>
> Example of sequence with rise 3:
>
> KO -> 0/3
> KO -> 0/3
> OK -> 1/3
> KO -> 1/3 <- should back to 0/3
> KO -> 1/3
> KO -> 1/3
> OK -> 2/3
> KO -> 2/3
> KO -> 2/3
> OK -> 3/3 -> Server UP
>
> Is there a way to configure the counter to reset itself in case of
flapping ?
>
> Thanks.
Hi there,
Thanks for reporting this behavior.
I'll have a look and come back to you.
Baptiste
Hello,
Are you able to reproduce on your side ?
Thanks