Hi Cyril,

On Tue, Aug 21, 2018 at 04:13:38PM +0200, Cyril Bonté wrote:
> > De: "Cyril Bonté" <cyril.bo...@free.fr>
> > À: "Willy Tarreau" <w...@1wt.eu>
> > Cc: "HAProxy" <haproxy@formilux.org>
> > Envoyé: Mardi 21 Août 2018 16:09:55
> > Objet: haproxy-1.9-dev [0c026f49e]: 100% CPU when a server goes DOWN with 
> > option redispatch
> > 
> > Hi Willy,
> > Here is another issue seen today with the current dev branch [tests
> > were also made after pulling recent commit 3bcc2699b].
> > 
> > Since 0c026f49e, when a server status is set to DOWN and option
> > redispatch is enabled, the haproxy process hits 100% CPU.
> > Even more, with the latest commits, if haproxy is compiled with
> > DEBUG_FULL, it will simply segfault.
> 
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x00007ffff703f2f1 in __GI_abort () at abort.c:79
> #2  0x000055555558040f in __spin_lock (line=<optimized out>, file=<optimized 
> out>, func=<optimized out>, l=<optimized out>, lbl=<optimized out>) at 
> include/common/hathreads.h:725
> #3  pendconn_redistribute (s=0x555555786d80) at src/queue.c:411
> #4  0x00005555555e7842 in srv_update_status () at src/server.c:4680
> #5  0x00005555555e89a1 in srv_set_stopped.part () at src/server.c:966
> #6  0x00005555555e8bd1 in srv_set_stopped (s=<optimized out>, 
> reason=<optimized out>, check=<optimized out>) at src/server.c:948
> #7  0x0000555555639358 in process_chk_conn (state=<optimized out>, 
> context=0x5555557871d0, t=0x555555783290) at src/checks.c:2265
> #8  process_chk () at src/checks.c:2304
> #9  0x00005555556a2293 in process_runnable_tasks () at src/task.c:384
> #10 0x00005555556408b9 in run_poll_loop () at src/haproxy.c:2386
> #11 run_thread_poll_loop () at src/haproxy.c:2451
> #12 0x0000555555581e4b in main () at src/haproxy.c:3053
> #13 0x00007ffff702ab17 in __libc_start_main (main=0x555555580e50 <main>, 
> argc=3, argv=0x7fffffffdfa8, init=<optimized out>, fini=<optimized out>, 
> rtld_fini=<optimized out>, stack_end=0x7fffffffdf98) at 
> ../csu/libc-start.c:310
> #14 0x0000555555583d0a in _start ()

Thank you for this trace. I'm currently debugging a very long chain of
insufficient locking in the server check code. Most operations from the
CLI simply take no lock, some status update functions assume they are
locked while they can be called from various places. This remained
partially hidden by the asynchronous mode called from the rendez-vous
point, but now that they are called synchronously everything breaks.

I have good hopes to sort this out, by starting to place comments at
the top of the functions to detail their expectations and to detect
all the wrong assumptions. If I can't sort it out like this, I can
easily revert the patch that made everything synchronous and decide
to go back to the asynchronous mode and hide all the dust under the
carpet. But as you know I really hate doing this so I prefer to take
it face to face for now.

Cheers,
Willy

Reply via email to