Hi Miles, On Tue, Nov 05, 2024 at 06:54:08PM +1100, Miles Hampson wrote: > Hi, > > I've encountered a situation where HAProxy does not fail over from a server > it has marked as DOWN to a backup server it has marked as UP. I have > managed to reproduce this consistently in a test environment, here is the > (I hope) relevant configuration > > defaults http_defaults_1 > mode http > hash-type consistent > hash-balance-factor 150 > maxconn 4096 > http-reuse safe > option abortonclose > option allbackups > timeout check 2s > timeout connect 15s > timeout client 30s > timeout server 30s > > backend server-backend from http_defaults_1 > balance roundrobin > # We keep at the default because retrying anything else risks duplicating > events on these servers > retry-on conn-failure > server server-006 fd8a...0006:8080 maxconn 32 check inter 250 alpn h2 > server server-012 fd8a...0012:8080 maxconn 32 check backup alpn h2 > > This is with HAproxy 3.0.3-95a607c running on a VPS with 16GB RAM (we have > seen the same issue on a dedicated server with 64GB though), which is > running Ubuntu 24.04.1 with the default net.ipv4.tcp_retries2 = 15, > net.ipv4.tcp_syn_retries = 6, and tcp_fin_timeout of 60s (these also apply > to IPv6 connections). CPU usage is under 20%. > > Once I have a small load running (20 req/sec), if I make the 8080 port on > server-006 temporarily unavailable by restarting the service on it, HAProxy > logs the transition of server-006 to DOWN (and the stats socket and > server_check_failure metrics show the same) and server-012 picks up > requests as expected, with no 5xx errors recorded. > However if I instead kill the server-006 machine (so that a TCP health > check to it with `nc` fails with a timeout rather than a connection > refused), the server is marked as DOWN as before, but all requests coming > in the HAProxy for that backend return a 5xx error to the client after 15s > (the timeout connect) and server-012 does not receive any requests despite > showing as UP in the stats socket. This "not failed over" state of 100% 5xx > errors goes on for minutes, sometimes hours, and how long seems to depend > on the load. Reducing the load to a few requests a minute avoids the issue > (and dropping the load when it is in the "not failed over" state also fixes > the issue). I would have expected the <=32 in flight requests to have been > redispatched to 012 as soon as 006 was marked down, and the other <=4096-32 > requests to have been held in the frontend queue until the backend ones > were finished, but understandably things get more complicated when you > consider timeouts.
This reminds me a bug related to the queues handling, regarding the fact that if there were already requests in queue on a server, subsequent requests would directly go into the queue as well regardless of the server's state, and be picked up by that server once finishing processing a previous request. I would appreciate it if you could recheck with an up to date version to be sure we're not chasing an already fixed issue. Also, if you have only one server of each type here (one active and one backup), determinists algorithms such as "balance source" should normally not exhibit this behavior. I'm not suggesting this as a solution but as a temporary workaround, of course. If you're building from sources, you can even try the latest 3.0-stable (about to be released as 3.0.6 soon). Thanks, Willy