Oriol Magrane <[email protected]> wrote:

> Hi all,
> we need help: after upgrading the OS of an haproxy server of ours to debian
> buster and getting haproxy upgraded to the OS’ package 1.8.19-1 version we
> started noticing eventual hangs of the process where it gets stuck at 100%
> CPU usage and it stops accepting connections of any kind (front-end HTTP
> connections, unix socket or TCP API queries or HTTP statistics retrievals).
> These hangs happen only from time to time (quite randomly, it may take 10
> minutes or several hours after starting haproxy), but we have observed they
> are always triggered by a “set maxconn server” command sent via API while
> haproxy is busy processing normal HTTP requests. (In order to keep some
> backends capacity up-to-date we have a cron that issues a batch of between
> 30 to 40 of these commands once per minute to haproxy’s API).
> We have been unable to establish any relation between the backend, server
> and maxconn values of the ‘set maxconn server’ command that caused a hang
> (they seem to differ on every occasion), and we have been unable to relate
> it to the haproxy workload at that given time either.
> We’ve also tried by source-compiling in the target machine a more recent
> version of the 1.8 series (1.8.29) too, but to no avail: the problem
> persists.
> The only thing we seem to have been able to identify is where in the code
> the process gets stuck: we attached a gdb to the haproxy process and let it
> run freely, feeding it with regular workload and the usual API-based
> maxconn server updates, until it hung. Interrupting the execution, and
> asking the debugger a backtrace resulted in this:
> ^C
> Program received signal SIGINT, Interrupt.
> 0x0000563d70c8613a in process_srv_queue (s=s@entry=0x563d70dd1960) at
> src/queue.c:177
> 177 HA_SPIN_LOCK(SERVER_LOCK, &s->lock);
> (gdb) bt full
> #0  0x0000563d70c8613a in process_srv_queue (s=s@entry=0x563d70dd1960) at
> src/queue.c:177
>         p = 0x563d70dcf330
>         maxconn = <optimized out>
>         remote = 0
> #1  0x0000563d70beda0f in server_parse_maxconn_change_request
> (maxconn_str=<optimized out>, sv=0x563d70dd1960) at src/server.c:1269
>         v = <optimized out>
>         end = <optimized out>
>         v = <optimized out>
>         end = <optimized out>
> #2  server_parse_maxconn_change_request (sv=0x563d70dd1960,
> maxconn_str=<optimized out>) at src/server.c:1247
>         v = <optimized out>
>         end = 0x563d70d69603 ""
> #3  0x0000563d70bf2db2 in cli_parse_set_maxconn_server (private=<optimized
> out>, appctx=0x563d7143ff30, args=0x7ffd15cb9d20) at src/server.c:4295
>         sv = 0x563d70dd1960
>         warning = <optimized out>
>         sv = <optimized out>
>         warning = <optimized out>
>         __pl_r = <optimized out>
>         __pl_l = <optimized out>
>         ret = <optimized out>
> #4  cli_parse_set_maxconn_server (args=0x7ffd15cb9d20,
> appctx=0x563d7143ff30, private=<optimized out>) at src/server.c:4281
>         sv = <optimized out>
>         warning = <optimized out>
>         __pl_r = <optimized out>
>         __pl_l = <optimized out>
>         ret = <optimized out>
> #5  0x0000563d70c4e5df in cli_parse_request
> (appctx=appctx@entry=0x563d7143ff30,
> line=0x563d70d69603 "") at src/cli.c:479
>         args = {0x563d70d695c0 "set", 0x563d70d695c4 "maxconn",
> 0x563d70d695cc "server", 0x563d70d695d3 "group_b_remote_backend",
> 0x563d70d69601 "19", 0x563d70d69603 "" <repeats 60 times>}
>         kw = <optimized out>
>         arg = <optimized out>
>         i = <optimized out>
>         j = <optimized out>
> #6  0x0000563d70c4ebde in cli_io_handler (appctx=0x563d7143ff30) at
> src/cli.c:622
>         si = 0x563d714a6ba0
>         req = 0x563d714a6940
>         res = 0x563d714a6980
>         bind_conf = <optimized out>
>         reql = 68
>         len = <optimized out>
> #7  0x0000563d70c8705e in applet_run_active () at src/applet.c:87
>         curr = <optimized out>
>         next = <optimized out>
>         si = 0x563d714a6ba0
>         applet_cur_queue = {n = 0x563d7143ff30, p = 0x563d7143ff30}
>         max_processed = <optimized out>
> #8  0x0000563d70c31600 in run_poll_loop () at src/haproxy.c:2563
>         next = <optimized out>
>         exp = <optimized out>
>         next = <optimized out>
>         exp = <optimized out>
> #9  run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2598
>         ptif = <optimized out>
>         ptdf = <optimized out>
>         start_lock = 0
> #10 0x0000563d70b8c6f9 in main (argc=<optimized out>, argv=<optimized out>)
> at src/haproxy.c:3201
>         tids = 0x563d710bda60
>         threads = 0x563d710bda80
>         i = <optimized out>
>         old_sig = {__val = {2048, 140724969120513, 140724969120352, 179,
> 94821886072224, 94821884909799, 94821886091488, 94821885435755,
> 140724969120255, 140724969120576, 140724969120271, 140724969120608,
> 206158430256, 140724969120576, 140724969120384, 7368483128895714304}}
>         blocked_sig = {__val = {18446744067199990583, 18446744073709551615
> <repeats 15 times>}}
>         err = <optimized out>
>         retry = <optimized out>
>         limit = {rlim_cur = 16497, rlim_max = 16497}
>         errmsg =
> "\000\000\000\000\000\000\000\000P\000\000\000\000\000\000\000\210\334\377\377\377\377\377\377\000\000\000\000\000\000\000\000\003\000\000\000\060",
> '\000' <repeats 19 times>, "[\000\000\000n\000\000\000|\000\000\000w",
> '\000' <repeats 11 times>,
> "\360\006\236\f\335\177\000\000@\274\235\f\335\177\000\000
> \000\000"
>         pidfd = <optimized out>
> ... so the server HA_SPIN_LOCK() in line 177 of queue.c seems to be the
> culprit.
> At this point, we don’t know what we can do in order to get rid of this
> problem. We tried with the new haproxy series 2.x and the issue is still
> there; the only difference is the haproxy process doesn’t hang, but it gets
> restarted instead (although I don’t know if the restart is systemd’s
> action).
> Any idea about what could we do?
> Thank you very much in advance!
> Oriol
> P.S. I don’t attach logs or system config to keep this first post simple,
> but if anyone wants to give them a look I’ll be happy to share; so just
> ask, please

Thanks for your report.

A deadlock may indeed arise when using 'set maxconn server' if the
function process_srv_queue is called, which is not always the case. I
had a quick look and the issue seems to be present in the latest version
as well. I will continue to inspect the problem in details.

As for the restart in the 2.x, it is probably due to the internal
haproxy watchdog which was added in the 2.0 version and is specifically
designed to detect this type of issues.

-- 
Amaury Denoyelle

Reply via email to