Hi William, On Fri, May 03, 2019 at 09:33:29AM +0000, William Dauchy wrote: > > Note that with all the scheduling issues we've fixed over the last > > days, there are multiple candidates which could cause this. Another > > one was the lack of effect of the nice parameter which is normally > > set on the CLI but the lack of which could result in socat timing > > out during the first half second in absence of any response. > > we got a similar issue with last v1.9.7+HEAD > (last commit > http://git.haproxy.org/?p=haproxy-1.9.git;a=commit;h=f3c64c69b1a293ae54db359a2b2a5f9e0c5265dd)
At first I thought you were again on a deadlock that I couldn't spot, due to the fact that nearly all threads were waiting on the LB lock, and I couldn't find how this could happen. But I didn't notice this one which is the most important : > Thread 15 (Thread 0x7fe9b6631700 (LWP 2808)): > #0 0x000056153d96d7a0 in __eb_insert_dup (new=0x56157f52f424, > sub=0x56157f5640a4) at ebtree/ebtree.h:478 > #1 eb_insert_dup (sub=<optimized out>, new=0x56157f52f424) at > ebtree/ebtree.c:31 > #2 0x000056153d96df10 in __eb32_insert (new=new@entry=0x56157f52f424, > root=<optimized out>, root@entry=0x56157deb4140) at ebtree/eb32tree.h:337 > #3 eb32_insert (root=root@entry=0x56157deb4140, > new=new@entry=0x56157f52f424) at ebtree/eb32tree.c:27 > #4 0x000056153d957fcb in fwrr_queue_srv (s=s@entry=0x56157f52f080) at > src/lb_fwrr.c:371 > #5 0x000056153d9585e8 in fwrr_update_server_weight (srv=0x56157f52f080) at > src/lb_fwrr.c:242 > #6 0x000056153d8ae8ac in srv_update_status (s=0x56157f52f080) at > src/server.c:4923 > #7 0x000056153d8adfc2 in server_recalc_eweight (sv=sv@entry=0x56157f52f080, > must_update=must_update@entry=1) at src/server.c:1310 > #8 0x000056153d8b6edd in server_warmup (t=0x5615899be8a0, > context=0x56157f52f080, state=<optimized out>) at src/checks.c:1492 > #9 0x000056153d94d97a in process_runnable_tasks () at src/task.c:390 > #10 0x000056153d8c5c4f in run_poll_loop () at src/haproxy.c:2661 > #11 run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2726 > #12 0x00007fe9bd5e7dd5 in start_thread () from /lib64/libpthread.so.0 > #13 0x00007fe9bc320ead in clone () from /lib64/libc.so.6 Thus I conclude that it crashed, and that all other threads just met at the same lock while the core was being dumped in this one. I figured what was missing, the server_warmup() function was missing a lock since 1.8. I've just fixed this and backported it to 1.9. I would be grateful if you could test it again, as I failed to reproduce the issue (it requires a high concurrency and bad luck, as often in such cases). Thanks! Willy