Hi Willy!

Actually I don't think this is a CPU fault. The reason is that I have same
cores with non-zero dividers on 4 more hardware servers with different CPU
models. So I agree upon another thread activity. The unique thing about
these servers – all of them use haproxy-agent to set up weights of their
backends. Other instances with no haproxy-agent in their configs don't
produce cores.

пн, 15 апр. 2019 г. в 23:48, Willy Tarreau <[email protected]>:

> Hi Maksim,
>
> On Thu, Apr 11, 2019 at 02:03:43PM +0200, Willy Tarreau wrote:
> > I tried to follow all paths that lead to a zero cur_eweight that I could
> > find and none of them leave the server in the tree. Then I tried to find
> > all cases where this entry is updated or used and all are under the
> server
> > lock, meaning that I don't see how another thread could have changed the
> > value between the check and the use. I must obviously be wrong on at
> least
> > one of them but I really can't figure which one.
>
> Actually I think I found one way to get there with a lock missing. The
> impossible case in your trace made me think that since it's very unlikely
> that the CPU is faulty (never impossible but extremely rare), another
> thread was possibly still doing something in our back before the crash
> happened, and fixed the value again before the dump was done. These are
> thus two very quick changes. I don't see what sequence of actions can do
> this but I think I want to study one code path that looks suspicious to
> me. I need to double-check this tomorrow after some sleep, I'll keep you
> informed.
>
> Cheers,
> Willy
>

Reply via email to