Amaury,
Your patch seems to have killed the bug: our newly recompiled haproxy is
running for more than 24h now without glitches (while the unpatched
versions never got past the 4h hour).
I'll let it run for 24 more hours at full load just in case, but things are
looking good and I don't expect to have to come back here crying.
So thank you very much for your help!

Oriol


Missatge de Amaury Denoyelle <[email protected]> del dia dl., 17 de
maig 2021 a les 10:32:

> Oriol Magrane <[email protected]> wrote:
>
> > Hi all,
> > we need help: after upgrading the OS of an haproxy server of ours to
> debian
> > buster and getting haproxy upgraded to the OS’ package 1.8.19-1 version
> we
> > started noticing eventual hangs of the process where it gets stuck at
> 100%
> > CPU usage and it stops accepting connections of any kind (front-end HTTP
> > connections, unix socket or TCP API queries or HTTP statistics
> retrievals).
> > These hangs happen only from time to time (quite randomly, it may take 10
> > minutes or several hours after starting haproxy), but we have observed
> they
> > are always triggered by a “set maxconn server” command sent via API while
> > haproxy is busy processing normal HTTP requests. (In order to keep some
> > backends capacity up-to-date we have a cron that issues a batch of
> between
> > 30 to 40 of these commands once per minute to haproxy’s API).
> > We have been unable to establish any relation between the backend, server
> > and maxconn values of the ‘set maxconn server’ command that caused a hang
> > (they seem to differ on every occasion), and we have been unable to
> relate
> > it to the haproxy workload at that given time either.
> > We’ve also tried by source-compiling in the target machine a more recent
> > version of the 1.8 series (1.8.29) too, but to no avail: the problem
> > persists.
> > The only thing we seem to have been able to identify is where in the code
> > the process gets stuck: we attached a gdb to the haproxy process and let
> it
> > run freely, feeding it with regular workload and the usual API-based
> > maxconn server updates, until it hung. Interrupting the execution, and
> > asking the debugger a backtrace resulted in this:
> > ^C
> > Program received signal SIGINT, Interrupt.
> > 0x0000563d70c8613a in process_srv_queue (s=s@entry=0x563d70dd1960) at
> > src/queue.c:177
> > 177 HA_SPIN_LOCK(SERVER_LOCK, &s->lock);
> > (gdb) bt full
> > #0  0x0000563d70c8613a in process_srv_queue (s=s@entry=0x563d70dd1960)
> at
> > src/queue.c:177
> >         p = 0x563d70dcf330
> >         maxconn = <optimized out>
> >         remote = 0
> > #1  0x0000563d70beda0f in server_parse_maxconn_change_request
> > (maxconn_str=<optimized out>, sv=0x563d70dd1960) at src/server.c:1269
> >         v = <optimized out>
> >         end = <optimized out>
> >         v = <optimized out>
> >         end = <optimized out>
> > #2  server_parse_maxconn_change_request (sv=0x563d70dd1960,
> > maxconn_str=<optimized out>) at src/server.c:1247
> >         v = <optimized out>
> >         end = 0x563d70d69603 ""
> > #3  0x0000563d70bf2db2 in cli_parse_set_maxconn_server
> (private=<optimized
> > out>, appctx=0x563d7143ff30, args=0x7ffd15cb9d20) at src/server.c:4295
> >         sv = 0x563d70dd1960
> >         warning = <optimized out>
> >         sv = <optimized out>
> >         warning = <optimized out>
> >         __pl_r = <optimized out>
> >         __pl_l = <optimized out>
> >         ret = <optimized out>
> > #4  cli_parse_set_maxconn_server (args=0x7ffd15cb9d20,
> > appctx=0x563d7143ff30, private=<optimized out>) at src/server.c:4281
> >         sv = <optimized out>
> >         warning = <optimized out>
> >         __pl_r = <optimized out>
> >         __pl_l = <optimized out>
> >         ret = <optimized out>
> > #5  0x0000563d70c4e5df in cli_parse_request
> > (appctx=appctx@entry=0x563d7143ff30,
> > line=0x563d70d69603 "") at src/cli.c:479
> >         args = {0x563d70d695c0 "set", 0x563d70d695c4 "maxconn",
> > 0x563d70d695cc "server", 0x563d70d695d3 "group_b_remote_backend",
> > 0x563d70d69601 "19", 0x563d70d69603 "" <repeats 60 times>}
> >         kw = <optimized out>
> >         arg = <optimized out>
> >         i = <optimized out>
> >         j = <optimized out>
> > #6  0x0000563d70c4ebde in cli_io_handler (appctx=0x563d7143ff30) at
> > src/cli.c:622
> >         si = 0x563d714a6ba0
> >         req = 0x563d714a6940
> >         res = 0x563d714a6980
> >         bind_conf = <optimized out>
> >         reql = 68
> >         len = <optimized out>
> > #7  0x0000563d70c8705e in applet_run_active () at src/applet.c:87
> >         curr = <optimized out>
> >         next = <optimized out>
> >         si = 0x563d714a6ba0
> >         applet_cur_queue = {n = 0x563d7143ff30, p = 0x563d7143ff30}
> >         max_processed = <optimized out>
> > #8  0x0000563d70c31600 in run_poll_loop () at src/haproxy.c:2563
> >         next = <optimized out>
> >         exp = <optimized out>
> >         next = <optimized out>
> >         exp = <optimized out>
> > #9  run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2598
> >         ptif = <optimized out>
> >         ptdf = <optimized out>
> >         start_lock = 0
> > #10 0x0000563d70b8c6f9 in main (argc=<optimized out>, argv=<optimized
> out>)
> > at src/haproxy.c:3201
> >         tids = 0x563d710bda60
> >         threads = 0x563d710bda80
> >         i = <optimized out>
> >         old_sig = {__val = {2048, 140724969120513, 140724969120352, 179,
> > 94821886072224, 94821884909799, 94821886091488, 94821885435755,
> > 140724969120255, 140724969120576, 140724969120271, 140724969120608,
> > 206158430256, 140724969120576, 140724969120384, 7368483128895714304}}
> >         blocked_sig = {__val = {18446744067199990583,
> 18446744073709551615
> > <repeats 15 times>}}
> >         err = <optimized out>
> >         retry = <optimized out>
> >         limit = {rlim_cur = 16497, rlim_max = 16497}
> >         errmsg =
> >
> "\000\000\000\000\000\000\000\000P\000\000\000\000\000\000\000\210\334\377\377\377\377\377\377\000\000\000\000\000\000\000\000\003\000\000\000\060",
> > '\000' <repeats 19 times>, "[\000\000\000n\000\000\000|\000\000\000w",
> > '\000' <repeats 11 times>,
> > "\360\006\236\f\335\177\000\000@\274\235\f\335\177\000\000
> > \000\000"
> >         pidfd = <optimized out>
> > ... so the server HA_SPIN_LOCK() in line 177 of queue.c seems to be the
> > culprit.
> > At this point, we don’t know what we can do in order to get rid of this
> > problem. We tried with the new haproxy series 2.x and the issue is still
> > there; the only difference is the haproxy process doesn’t hang, but it
> gets
> > restarted instead (although I don’t know if the restart is systemd’s
> > action).
> > Any idea about what could we do?
> > Thank you very much in advance!
> > Oriol
> > P.S. I don’t attach logs or system config to keep this first post simple,
> > but if anyone wants to give them a look I’ll be happy to share; so just
> > ask, please
>
> Thanks for your report.
>
> A deadlock may indeed arise when using 'set maxconn server' if the
> function process_srv_queue is called, which is not always the case. I
> had a quick look and the issue seems to be present in the latest version
> as well. I will continue to inspect the problem in details.
>
> As for the restart in the 2.x, it is probably due to the internal
> haproxy watchdog which was added in the 2.0 version and is specifically
> designed to detect this type of issues.
>
> --
> Amaury Denoyelle
>

Reply via email to