Hi all,
we need help: after upgrading the OS of an haproxy server of ours to debian
buster and getting haproxy upgraded to the OS’ package 1.8.19-1 version we
started noticing eventual hangs of the process where it gets stuck at 100%
CPU usage and it stops accepting connections of any kind (front-end HTTP
connections, unix socket or TCP API queries or HTTP statistics retrievals).
These hangs happen only from time to time (quite randomly, it may take 10
minutes or several hours after starting haproxy), but we have observed they
are always triggered by a “set maxconn server” command sent via API while
haproxy is busy processing normal HTTP requests. (In order to keep some
backends capacity up-to-date we have a cron that issues a batch of between
30 to 40 of these commands once per minute to haproxy’s API).
We have been unable to establish any relation between the backend, server
and maxconn values of the ‘set maxconn server’ command that caused a hang
(they seem to differ on every occasion), and we have been unable to relate
it to the haproxy workload at that given time either.
We’ve also tried by source-compiling in the target machine a more recent
version of the 1.8 series (1.8.29) too, but to no avail: the problem
persists.
The only thing we seem to have been able to identify is where in the code
the process gets stuck: we attached a gdb to the haproxy process and let it
run freely, feeding it with regular workload and the usual API-based
maxconn server updates, until it hung. Interrupting the execution, and
asking the debugger a backtrace resulted in this:
^C
Program received signal SIGINT, Interrupt.
0x0000563d70c8613a in process_srv_queue (s=s@entry=0x563d70dd1960) at
src/queue.c:177
177 HA_SPIN_LOCK(SERVER_LOCK, &s->lock);
(gdb) bt full
#0 0x0000563d70c8613a in process_srv_queue (s=s@entry=0x563d70dd1960) at
src/queue.c:177
p = 0x563d70dcf330
maxconn = <optimized out>
remote = 0
#1 0x0000563d70beda0f in server_parse_maxconn_change_request
(maxconn_str=<optimized out>, sv=0x563d70dd1960) at src/server.c:1269
v = <optimized out>
end = <optimized out>
v = <optimized out>
end = <optimized out>
#2 server_parse_maxconn_change_request (sv=0x563d70dd1960,
maxconn_str=<optimized out>) at src/server.c:1247
v = <optimized out>
end = 0x563d70d69603 ""
#3 0x0000563d70bf2db2 in cli_parse_set_maxconn_server (private=<optimized
out>, appctx=0x563d7143ff30, args=0x7ffd15cb9d20) at src/server.c:4295
sv = 0x563d70dd1960
warning = <optimized out>
sv = <optimized out>
warning = <optimized out>
__pl_r = <optimized out>
__pl_l = <optimized out>
ret = <optimized out>
#4 cli_parse_set_maxconn_server (args=0x7ffd15cb9d20,
appctx=0x563d7143ff30, private=<optimized out>) at src/server.c:4281
sv = <optimized out>
warning = <optimized out>
__pl_r = <optimized out>
__pl_l = <optimized out>
ret = <optimized out>
#5 0x0000563d70c4e5df in cli_parse_request
(appctx=appctx@entry=0x563d7143ff30,
line=0x563d70d69603 "") at src/cli.c:479
args = {0x563d70d695c0 "set", 0x563d70d695c4 "maxconn",
0x563d70d695cc "server", 0x563d70d695d3 "group_b_remote_backend",
0x563d70d69601 "19", 0x563d70d69603 "" <repeats 60 times>}
kw = <optimized out>
arg = <optimized out>
i = <optimized out>
j = <optimized out>
#6 0x0000563d70c4ebde in cli_io_handler (appctx=0x563d7143ff30) at
src/cli.c:622
si = 0x563d714a6ba0
req = 0x563d714a6940
res = 0x563d714a6980
bind_conf = <optimized out>
reql = 68
len = <optimized out>
#7 0x0000563d70c8705e in applet_run_active () at src/applet.c:87
curr = <optimized out>
next = <optimized out>
si = 0x563d714a6ba0
applet_cur_queue = {n = 0x563d7143ff30, p = 0x563d7143ff30}
max_processed = <optimized out>
#8 0x0000563d70c31600 in run_poll_loop () at src/haproxy.c:2563
next = <optimized out>
exp = <optimized out>
next = <optimized out>
exp = <optimized out>
#9 run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2598
ptif = <optimized out>
ptdf = <optimized out>
start_lock = 0
#10 0x0000563d70b8c6f9 in main (argc=<optimized out>, argv=<optimized out>)
at src/haproxy.c:3201
tids = 0x563d710bda60
threads = 0x563d710bda80
i = <optimized out>
old_sig = {__val = {2048, 140724969120513, 140724969120352, 179,
94821886072224, 94821884909799, 94821886091488, 94821885435755,
140724969120255, 140724969120576, 140724969120271, 140724969120608,
206158430256, 140724969120576, 140724969120384, 7368483128895714304}}
blocked_sig = {__val = {18446744067199990583, 18446744073709551615
<repeats 15 times>}}
err = <optimized out>
retry = <optimized out>
limit = {rlim_cur = 16497, rlim_max = 16497}
errmsg =
"\000\000\000\000\000\000\000\000P\000\000\000\000\000\000\000\210\334\377\377\377\377\377\377\000\000\000\000\000\000\000\000\003\000\000\000\060",
'\000' <repeats 19 times>, "[\000\000\000n\000\000\000|\000\000\000w",
'\000' <repeats 11 times>,
"\360\006\236\f\335\177\000\000@\274\235\f\335\177\000\000
\000\000"
pidfd = <optimized out>
... so the server HA_SPIN_LOCK() in line 177 of queue.c seems to be the
culprit.
At this point, we don’t know what we can do in order to get rid of this
problem. We tried with the new haproxy series 2.x and the issue is still
there; the only difference is the haproxy process doesn’t hang, but it gets
restarted instead (although I don’t know if the restart is systemd’s
action).
Any idea about what could we do?
Thank you very much in advance!
Oriol
P.S. I don’t attach logs or system config to keep this first post simple,
but if anyone wants to give them a look I’ll be happy to share; so just
ask, please