Dear all,
We are using the backend health checks to disable flapping backends.
The default values for rise and fall are 2 subsequent succeeded and 3
subsequent failed checks.
Our check interval is at 1000ms (a little frequent, potentially part of the
problem).
Here is what we observed, using HAProxy 2.4.4:
1. Falling
It started with the backend being up and then going down (fall).
> 2022-03-23T21:31:54.942Z Health check for server
> http-routers-http1/node4 failed, reason: Layer4 timeout, check duration:
> 1000ms, status: 2/3 UP.
> 2022-03-23T21:31:56.920Z Health check for server
> http-routers-http1/node4 failed, reason: Layer4 timeout, check duration:
> 1001ms, status: 1/3 UP.
> 2022-03-23T21:31:57.931Z Health check for server
> http-routers-http1/node4 succeeded, reason: Layer7 check passed, code: 200,
> check duration: 1ms, status: 3/3 UP.
> 2022-03-24T10:03:27.223Z Health check for server
> http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503,
> info: "Service Unavailable", check duration: 1ms, status: 2/3 UP.
> 2022-03-24T10:03:28.234Z Health check for server
> http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503,
> info: "Service Unavailable", check duration: 1ms, status: 1/3 UP.
> 2022-03-24T10:03:29.237Z Health check for server
> http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503,
> info: "Service Unavailable", check duration: 1ms, status: 0/2 DOWN.
We go down from 3/3 to 2/3, 1/3 and back up again to 3/3. My assumption is that
it then measured 2/3, but only needs 2 for rising, i.e. 2/2, which is bumped to
3/3 as the backend is now considered up.
The backend stays up for a while and then goes down with my expected health
checks, i.e. 3/3, 2/3, 1/3, 0/3 -> 0/2 (as we need 2 for rise).
2. Rising
> 2022-03-24T10:12:26.846Z Health check for server
> http-routers-http1/node4 failed, reason: Layer4 timeout, check duration:
> 1000ms, status: 0/2 DOWN.
> 2022-03-24T10:12:29.843Z Health check for server
> http-routers-http1/node4 failed, reason: Layer4 connection problem, info:
> "Connection refused", check duration: 1ms, status: 0/2 DOWN.
> 2022-03-24T10:13:43.902Z Health check for server
> http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503,
> info: "Service Unavailable", check duration: 2ms, status: 0/2 DOWN.
> 2022-03-24T10:14:03.039Z Health check for server
> http-routers-http1/node4 succeeded, reason: Layer7 check passed, code: 200,
> check duration: 1ms, status: 1/2 DOWN.
> 2022-03-24T10:14:04.079Z Health check for server
> http-routers-http1/node4 succeeded, reason: Layer7 check passed, code: 200,
> check duration: 1ms, status: 3/3 UP.
So coming up (rise), it goes from 0/2 probes to 1/2 to 3/3. My assumption that
it goes to 2/2, is considered up and is bumped to 3/3 because for fall we now
need 3 failed probes.
The documentation describes rise / fall as “number of subsequent probes that
succeeded / failed.
From my observations it looks like it is a sliding window of the last n being
successful, i.e. when the number of fall is larger than rise, it is easier to
rise back up with a single successful probe.
Maybe I’m misreading the log outputs or drawing the wrong conclusions.
If someone knows by heart how it’s supposed to work based on the code that
would be great. Otherwise we can dig some more ourselves.
Thanks and kind regards,
Alex