Le 3/29/22 à 18:02, Lais, Alexander a écrit :
Dear all,

We are using the backend health checks to disable flapping backends.

The default values for rise and fall are 2 subsequent succeeded and 3 
subsequent failed checks.

Our check interval is at 1000ms (a little frequent, potentially part of the 
problem).

Here is what we observed, using HAProxy 2.4.4:

1. Falling

It started with the backend being up and then going down (fall).

2022-03-23T21:31:54.942Z        Health check for server 
http-routers-http1/node4 failed, reason: Layer4 timeout, check duration: 
1000ms, status: 2/3 UP.
2022-03-23T21:31:56.920Z        Health check for server 
http-routers-http1/node4 failed, reason: Layer4 timeout, check duration: 
1001ms, status: 1/3 UP.
2022-03-23T21:31:57.931Z        Health check for server 
http-routers-http1/node4 succeeded, reason: Layer7 check passed, code: 200, 
check duration: 1ms, status: 3/3 UP.
2022-03-24T10:03:27.223Z        Health check for server http-routers-http1/node4 failed, 
reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check 
duration: 1ms, status: 2/3 UP.
2022-03-24T10:03:28.234Z        Health check for server http-routers-http1/node4 failed, 
reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check 
duration: 1ms, status: 1/3 UP.
2022-03-24T10:03:29.237Z        Health check for server http-routers-http1/node4 failed, 
reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check 
duration: 1ms, status: 0/2 DOWN.

We go down from 3/3 to 2/3, 1/3 and back up again to 3/3. My assumption is that 
it then measured 2/3, but only needs 2 for rising, i.e. 2/2, which is bumped to 
3/3 as the backend is now considered up.

The backend stays up for a while and then goes down with my expected health 
checks, i.e. 3/3, 2/3, 1/3, 0/3 -> 0/2 (as we need 2 for rise).

2. Rising

2022-03-24T10:12:26.846Z        Health check for server 
http-routers-http1/node4 failed, reason: Layer4 timeout, check duration: 
1000ms, status: 0/2 DOWN.
2022-03-24T10:12:29.843Z        Health check for server http-routers-http1/node4 failed, 
reason: Layer4 connection problem, info: "Connection refused", check duration: 
1ms, status: 0/2 DOWN.
2022-03-24T10:13:43.902Z        Health check for server http-routers-http1/node4 failed, 
reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check 
duration: 2ms, status: 0/2 DOWN.
2022-03-24T10:14:03.039Z        Health check for server 
http-routers-http1/node4 succeeded, reason: Layer7 check passed, code: 200, 
check duration: 1ms, status: 1/2 DOWN.
2022-03-24T10:14:04.079Z        Health check for server 
http-routers-http1/node4 succeeded, reason: Layer7 check passed, code: 200, 
check duration: 1ms, status: 3/3 UP.

So coming up (rise), it goes from 0/2 probes to 1/2 to 3/3. My assumption that 
it goes to 2/2, is considered up and is bumped to 3/3 because for fall we now 
need 3 failed probes.


The documentation describes rise / fall as “number of subsequent probes that 
succeeded / failed.
 From my observations it looks like it is a sliding window of the last n being 
successful, i.e. when the number of fall is larger than rise, it is easier to 
rise back up with a single successful probe.

Maybe I’m misreading the log outputs or drawing the wrong conclusions.

If someone knows by heart how it’s supposed to work based on the code that 
would be great. Otherwise we can dig some more ourselves.


Hi,

Rise and fall values are the number of consecutive successful/unsuccessful health checks. When a server is DOWN, we count the number of consecutive successful health checks. If the counter reaches the rise value, the server is considered as UP. Otherwise, on each failure, the counter is reset. The same is done when the server is UP. we count the number of consecutive unsuccessful health checks. If the counter reaches the fall value, the server is considered as DOWN. Otherwise, on each success, the counter is reset.

Internally it is a bit more complex but the idea is the same.

In logs, the rise value is reported when the server is DOWN (X/rise) and the counter is incremented on each success (so from 0 to rise-1). And the fall value is reported when the server is UP (Y/fall) and the counter is decremented on each failure (from fall to 1). So when the server is set to DOWN state, you will never see "0/3 UP" in logs but "0/2 DOWN" instead. The same is true when the server is set to UP state, "2/2 UP" is never reported because "0/3 DOWN" is reported.

And you're right, with a rise value lower than the fall value it is quicker to consider a DOWN server as UP than the opposite. But with a rise to 2, we need 2 successful health checks to set a server UP.

--
Christopher Faulet

Reply via email to