Hi again, On Thu, Apr 11, 2019 at 11:53:28AM +0200, Willy Tarreau wrote: > > Got multiple incidents of failure with 1.9.6: > > Core was generated by `/usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p > > /var/run/haproxy'. > > Program terminated with signal SIGFPE, Arithmetic exception. > > #0 0x0000559afb73c533 in fwrr_update_position (grp=0x559afbd9fb68, > > grp=0x559afbd9fb68, s=0x559afcc5f560) at src/lb_fwrr.c:498 > > 498 HA_ATOMIC_ADD(&s->npos, (grp->next_weight / s->cur_eweight)); > > [Current thread is 1 (Thread 0x7f879677c700 (LWP 776412))] > > (gdb) thread apply all bt > > Scary, that's not supposed to be possible in theory : > > /* Computes next position of server <s> in the group. It is mandatory for > <s> > * to have a non-zero, positive eweight. > ^^^^^^^^^ > * > * The server's lock and the lbprm's lock must be held. > */ > static inline void fwrr_update_position(struct fwrr_group *grp, struct > server *s) > > So either we're doing something wrong somewhere in a caller, or we have > insufficient locking and sometimes this server's weight is put down to > zero between the moment the value is checked and the moment it's used. > > I'm having a look at it right now.
I tried to follow all paths that lead to a zero cur_eweight that I could find and none of them leave the server in the tree. Then I tried to find all cases where this entry is updated or used and all are under the server lock, meaning that I don't see how another thread could have changed the value between the check and the use. I must obviously be wrong on at least one of them but I really can't figure which one. I guess the core will probably help a little bit if you still have it somewhere. Thanks, Willy

