Hi again,

On Thu, Apr 11, 2019 at 11:53:28AM +0200, Willy Tarreau wrote:
> > Got multiple incidents of failure with 1.9.6:
> > Core was generated by `/usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p
> > /var/run/haproxy'.
> > Program terminated with signal SIGFPE, Arithmetic exception.
> > #0  0x0000559afb73c533 in fwrr_update_position (grp=0x559afbd9fb68,
> > grp=0x559afbd9fb68, s=0x559afcc5f560) at src/lb_fwrr.c:498
> > 498 HA_ATOMIC_ADD(&s->npos, (grp->next_weight / s->cur_eweight));
> > [Current thread is 1 (Thread 0x7f879677c700 (LWP 776412))]
> > (gdb) thread apply all bt
> 
> Scary, that's not supposed to be possible in theory :
> 
>   /* Computes next position of server <s> in the group. It is mandatory for 
> <s>
>    * to have a non-zero, positive eweight.
>                ^^^^^^^^^
>    *
>    * The server's lock and the lbprm's lock must be held.
>    */
>   static inline void fwrr_update_position(struct fwrr_group *grp, struct 
> server *s)
> 
> So either we're doing something wrong somewhere in a caller, or we have
> insufficient locking and sometimes this server's weight is put down to
> zero between the moment the value is checked and the moment it's used.
> 
> I'm having a look at it right now.

I tried to follow all paths that lead to a zero cur_eweight that I could
find and none of them leave the server in the tree. Then I tried to find
all cases where this entry is updated or used and all are under the server
lock, meaning that I don't see how another thread could have changed the
value between the check and the use. I must obviously be wrong on at least
one of them but I really can't figure which one. I guess the core will
probably help a little bit if you still have it somewhere.

Thanks,
Willy

Reply via email to