Hi William,

On Fri, May 03, 2019 at 09:33:29AM +0000, William Dauchy wrote:
> > Note that with all the scheduling issues we've fixed over the last
> > days, there are multiple candidates which could cause this. Another
> > one was the lack of effect of the nice parameter which is normally
> > set on the CLI but the lack of which could result in socat timing
> > out during the first half second in absence of any response.
> 
> we got a similar issue with last v1.9.7+HEAD
> (last commit 
> http://git.haproxy.org/?p=haproxy-1.9.git;a=commit;h=f3c64c69b1a293ae54db359a2b2a5f9e0c5265dd)

At first I thought you were again on a deadlock that I couldn't spot, due
to the fact that nearly all threads were waiting on the LB lock, and I
couldn't find how this could happen. But I didn't notice this one which
is the most important :

> Thread 15 (Thread 0x7fe9b6631700 (LWP 2808)):
> #0  0x000056153d96d7a0 in __eb_insert_dup (new=0x56157f52f424, 
> sub=0x56157f5640a4) at ebtree/ebtree.h:478
> #1  eb_insert_dup (sub=<optimized out>, new=0x56157f52f424) at 
> ebtree/ebtree.c:31
> #2  0x000056153d96df10 in __eb32_insert (new=new@entry=0x56157f52f424, 
> root=<optimized out>, root@entry=0x56157deb4140) at ebtree/eb32tree.h:337
> #3  eb32_insert (root=root@entry=0x56157deb4140, 
> new=new@entry=0x56157f52f424) at ebtree/eb32tree.c:27
> #4  0x000056153d957fcb in fwrr_queue_srv (s=s@entry=0x56157f52f080) at 
> src/lb_fwrr.c:371
> #5  0x000056153d9585e8 in fwrr_update_server_weight (srv=0x56157f52f080) at 
> src/lb_fwrr.c:242
> #6  0x000056153d8ae8ac in srv_update_status (s=0x56157f52f080) at 
> src/server.c:4923
> #7  0x000056153d8adfc2 in server_recalc_eweight (sv=sv@entry=0x56157f52f080, 
> must_update=must_update@entry=1) at src/server.c:1310
> #8  0x000056153d8b6edd in server_warmup (t=0x5615899be8a0, 
> context=0x56157f52f080, state=<optimized out>) at src/checks.c:1492
> #9  0x000056153d94d97a in process_runnable_tasks () at src/task.c:390
> #10 0x000056153d8c5c4f in run_poll_loop () at src/haproxy.c:2661
> #11 run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2726
> #12 0x00007fe9bd5e7dd5 in start_thread () from /lib64/libpthread.so.0
> #13 0x00007fe9bc320ead in clone () from /lib64/libc.so.6

Thus I conclude that it crashed, and that all other threads just met at
the same lock while the core was being dumped in this one. I figured what
was missing, the server_warmup() function was missing a lock since 1.8.
I've just fixed this and backported it to 1.9. I would be grateful if
you could test it again, as I failed to reproduce the issue (it requires
a high concurrency and bad luck, as often in such cases).

Thanks!
Willy

Reply via email to