Hi Cyril, On Tue, Aug 21, 2018 at 04:13:38PM +0200, Cyril Bonté wrote: > > De: "Cyril Bonté" <cyril.bo...@free.fr> > > À: "Willy Tarreau" <w...@1wt.eu> > > Cc: "HAProxy" <haproxy@formilux.org> > > Envoyé: Mardi 21 Août 2018 16:09:55 > > Objet: haproxy-1.9-dev [0c026f49e]: 100% CPU when a server goes DOWN with > > option redispatch > > > > Hi Willy, > > Here is another issue seen today with the current dev branch [tests > > were also made after pulling recent commit 3bcc2699b]. > > > > Since 0c026f49e, when a server status is set to DOWN and option > > redispatch is enabled, the haproxy process hits 100% CPU. > > Even more, with the latest commits, if haproxy is compiled with > > DEBUG_FULL, it will simply segfault. > > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 > #1 0x00007ffff703f2f1 in __GI_abort () at abort.c:79 > #2 0x000055555558040f in __spin_lock (line=<optimized out>, file=<optimized > out>, func=<optimized out>, l=<optimized out>, lbl=<optimized out>) at > include/common/hathreads.h:725 > #3 pendconn_redistribute (s=0x555555786d80) at src/queue.c:411 > #4 0x00005555555e7842 in srv_update_status () at src/server.c:4680 > #5 0x00005555555e89a1 in srv_set_stopped.part () at src/server.c:966 > #6 0x00005555555e8bd1 in srv_set_stopped (s=<optimized out>, > reason=<optimized out>, check=<optimized out>) at src/server.c:948 > #7 0x0000555555639358 in process_chk_conn (state=<optimized out>, > context=0x5555557871d0, t=0x555555783290) at src/checks.c:2265 > #8 process_chk () at src/checks.c:2304 > #9 0x00005555556a2293 in process_runnable_tasks () at src/task.c:384 > #10 0x00005555556408b9 in run_poll_loop () at src/haproxy.c:2386 > #11 run_thread_poll_loop () at src/haproxy.c:2451 > #12 0x0000555555581e4b in main () at src/haproxy.c:3053 > #13 0x00007ffff702ab17 in __libc_start_main (main=0x555555580e50 <main>, > argc=3, argv=0x7fffffffdfa8, init=<optimized out>, fini=<optimized out>, > rtld_fini=<optimized out>, stack_end=0x7fffffffdf98) at > ../csu/libc-start.c:310 > #14 0x0000555555583d0a in _start ()
Thank you for this trace. I'm currently debugging a very long chain of insufficient locking in the server check code. Most operations from the CLI simply take no lock, some status update functions assume they are locked while they can be called from various places. This remained partially hidden by the asynchronous mode called from the rendez-vous point, but now that they are called synchronously everything breaks. I have good hopes to sort this out, by starting to place comments at the top of the functions to detail their expectations and to detect all the wrong assumptions. If I can't sort it out like this, I can easily revert the patch that made everything synchronous and decide to go back to the asynchronous mode and hide all the dust under the carpet. But as you know I really hate doing this so I prefer to take it face to face for now. Cheers, Willy