Dear all, We are using the backend health checks to disable flapping backends.
The default values for rise and fall are 2 subsequent succeeded and 3 subsequent failed checks. Our check interval is at 1000ms (a little frequent, potentially part of the problem). Here is what we observed, using HAProxy 2.4.4: 1. Falling It started with the backend being up and then going down (fall). > 2022-03-23T21:31:54.942Z Health check for server > http-routers-http1/node4 failed, reason: Layer4 timeout, check duration: > 1000ms, status: 2/3 UP. > 2022-03-23T21:31:56.920Z Health check for server > http-routers-http1/node4 failed, reason: Layer4 timeout, check duration: > 1001ms, status: 1/3 UP. > 2022-03-23T21:31:57.931Z Health check for server > http-routers-http1/node4 succeeded, reason: Layer7 check passed, code: 200, > check duration: 1ms, status: 3/3 UP. > 2022-03-24T10:03:27.223Z Health check for server > http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503, > info: "Service Unavailable", check duration: 1ms, status: 2/3 UP. > 2022-03-24T10:03:28.234Z Health check for server > http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503, > info: "Service Unavailable", check duration: 1ms, status: 1/3 UP. > 2022-03-24T10:03:29.237Z Health check for server > http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503, > info: "Service Unavailable", check duration: 1ms, status: 0/2 DOWN. We go down from 3/3 to 2/3, 1/3 and back up again to 3/3. My assumption is that it then measured 2/3, but only needs 2 for rising, i.e. 2/2, which is bumped to 3/3 as the backend is now considered up. The backend stays up for a while and then goes down with my expected health checks, i.e. 3/3, 2/3, 1/3, 0/3 -> 0/2 (as we need 2 for rise). 2. Rising > 2022-03-24T10:12:26.846Z Health check for server > http-routers-http1/node4 failed, reason: Layer4 timeout, check duration: > 1000ms, status: 0/2 DOWN. > 2022-03-24T10:12:29.843Z Health check for server > http-routers-http1/node4 failed, reason: Layer4 connection problem, info: > "Connection refused", check duration: 1ms, status: 0/2 DOWN. > 2022-03-24T10:13:43.902Z Health check for server > http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503, > info: "Service Unavailable", check duration: 2ms, status: 0/2 DOWN. > 2022-03-24T10:14:03.039Z Health check for server > http-routers-http1/node4 succeeded, reason: Layer7 check passed, code: 200, > check duration: 1ms, status: 1/2 DOWN. > 2022-03-24T10:14:04.079Z Health check for server > http-routers-http1/node4 succeeded, reason: Layer7 check passed, code: 200, > check duration: 1ms, status: 3/3 UP. So coming up (rise), it goes from 0/2 probes to 1/2 to 3/3. My assumption that it goes to 2/2, is considered up and is bumped to 3/3 because for fall we now need 3 failed probes. The documentation describes rise / fall as “number of subsequent probes that succeeded / failed. From my observations it looks like it is a sliding window of the last n being successful, i.e. when the number of fall is larger than rise, it is easier to rise back up with a single successful probe. Maybe I’m misreading the log outputs or drawing the wrong conclusions. If someone knows by heart how it’s supposed to work based on the code that would be great. Otherwise we can dig some more ourselves. Thanks and kind regards, Alex