Hi Willy! Actually I don't think this is a CPU fault. The reason is that I have same cores with non-zero dividers on 4 more hardware servers with different CPU models. So I agree upon another thread activity. The unique thing about these servers – all of them use haproxy-agent to set up weights of their backends. Other instances with no haproxy-agent in their configs don't produce cores.
пн, 15 апр. 2019 г. в 23:48, Willy Tarreau <[email protected]>: > Hi Maksim, > > On Thu, Apr 11, 2019 at 02:03:43PM +0200, Willy Tarreau wrote: > > I tried to follow all paths that lead to a zero cur_eweight that I could > > find and none of them leave the server in the tree. Then I tried to find > > all cases where this entry is updated or used and all are under the > server > > lock, meaning that I don't see how another thread could have changed the > > value between the check and the use. I must obviously be wrong on at > least > > one of them but I really can't figure which one. > > Actually I think I found one way to get there with a lock missing. The > impossible case in your trace made me think that since it's very unlikely > that the CPU is faulty (never impossible but extremely rare), another > thread was possibly still doing something in our back before the crash > happened, and fixed the value again before the dump was done. These are > thus two very quick changes. I don't see what sequence of actions can do > this but I think I want to study one code path that looks suspicious to > me. I need to double-check this tomorrow after some sleep, I'll keep you > informed. > > Cheers, > Willy >

