Dear all,

We are using the backend health checks to disable flapping backends.

The default values for rise and fall are 2 subsequent succeeded and 3 
subsequent failed checks.

Our check interval is at 1000ms (a little frequent, potentially part of the 
problem).

Here is what we observed, using HAProxy 2.4.4:

1. Falling

It started with the backend being up and then going down (fall).

> 2022-03-23T21:31:54.942Z      Health check for server 
> http-routers-http1/node4 failed, reason: Layer4 timeout, check duration: 
> 1000ms, status: 2/3 UP.
> 2022-03-23T21:31:56.920Z      Health check for server 
> http-routers-http1/node4 failed, reason: Layer4 timeout, check duration: 
> 1001ms, status: 1/3 UP.
> 2022-03-23T21:31:57.931Z      Health check for server 
> http-routers-http1/node4 succeeded, reason: Layer7 check passed, code: 200, 
> check duration: 1ms, status: 3/3 UP.
> 2022-03-24T10:03:27.223Z      Health check for server 
> http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503, 
> info: "Service Unavailable", check duration: 1ms, status: 2/3 UP.
> 2022-03-24T10:03:28.234Z      Health check for server 
> http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503, 
> info: "Service Unavailable", check duration: 1ms, status: 1/3 UP.
> 2022-03-24T10:03:29.237Z      Health check for server 
> http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503, 
> info: "Service Unavailable", check duration: 1ms, status: 0/2 DOWN.

We go down from 3/3 to 2/3, 1/3 and back up again to 3/3. My assumption is that 
it then measured 2/3, but only needs 2 for rising, i.e. 2/2, which is bumped to 
3/3 as the backend is now considered up.

The backend stays up for a while and then goes down with my expected health 
checks, i.e. 3/3, 2/3, 1/3, 0/3 -> 0/2 (as we need 2 for rise).

2. Rising

> 2022-03-24T10:12:26.846Z      Health check for server 
> http-routers-http1/node4 failed, reason: Layer4 timeout, check duration: 
> 1000ms, status: 0/2 DOWN.
> 2022-03-24T10:12:29.843Z      Health check for server 
> http-routers-http1/node4 failed, reason: Layer4 connection problem, info: 
> "Connection refused", check duration: 1ms, status: 0/2 DOWN.
> 2022-03-24T10:13:43.902Z      Health check for server 
> http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503, 
> info: "Service Unavailable", check duration: 2ms, status: 0/2 DOWN.
> 2022-03-24T10:14:03.039Z      Health check for server 
> http-routers-http1/node4 succeeded, reason: Layer7 check passed, code: 200, 
> check duration: 1ms, status: 1/2 DOWN.
> 2022-03-24T10:14:04.079Z      Health check for server 
> http-routers-http1/node4 succeeded, reason: Layer7 check passed, code: 200, 
> check duration: 1ms, status: 3/3 UP.

So coming up (rise), it goes from 0/2 probes to 1/2 to 3/3. My assumption that 
it goes to 2/2, is considered up and is bumped to 3/3 because for fall we now 
need 3 failed probes.


The documentation describes rise / fall as “number of subsequent probes that 
succeeded / failed.
From my observations it looks like it is a sliding window of the last n being 
successful, i.e. when the number of fall is larger than rise, it is easier to 
rise back up with a single successful probe.

Maybe I’m misreading the log outputs or drawing the wrong conclusions.

If someone knows by heart how it’s supposed to work based on the code that 
would be great. Otherwise we can dig some more ourselves.

Thanks and kind regards,
Alex
        

Reply via email to