Hi Maksim,

On Thu, Apr 11, 2019 at 02:03:43PM +0200, Willy Tarreau wrote:
> I tried to follow all paths that lead to a zero cur_eweight that I could
> find and none of them leave the server in the tree. Then I tried to find
> all cases where this entry is updated or used and all are under the server
> lock, meaning that I don't see how another thread could have changed the
> value between the check and the use. I must obviously be wrong on at least
> one of them but I really can't figure which one.

Actually I think I found one way to get there with a lock missing. The
impossible case in your trace made me think that since it's very unlikely
that the CPU is faulty (never impossible but extremely rare), another
thread was possibly still doing something in our back before the crash
happened, and fixed the value again before the dump was done. These are
thus two very quick changes. I don't see what sequence of actions can do
this but I think I want to study one code path that looks suspicious to
me. I need to double-check this tomorrow after some sleep, I'll keep you
informed.

Cheers,
Willy

Reply via email to