On 6/20/16 9:39 AM, Gleb Smirnoff wrote:
> On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote:
> J> > Comparing stable/10 and head, I see two changes that could
> J> > affect that:
> J> > 
> J> > - callout_async_drain
> J> > - switch to READ lock for inp info in tcp timers
> J> > 
> J> > That's why you are in To, Julien and Hans :)
> J> > 
> J> > We continue investigating, and I will keep you updated.
> J> > However, any help is welcome. I can share cores.
> Now, spending some time with cores and adding a bunch of
> extra CTRs, I have a sequence of events that lead to the
> panic. In short, the bug is in the callout system. It seems
> to be not relevant to the callout_async_drain, at least for
> now. The transition to READ lock unmasked the problem, that's
> why NetflixBSD 10 doesn't panic.
> The panic requires heavy contention on the TCP info lock.
> [CPU 1] the callout fires, tcp_timer_keep entered
> [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
> [CPU 2] schedules the callout
> [CPU 2] tcp_discardcb called
> [CPU 2] callout successfully canceled
> [CPU 2] tcpcb freed
> [CPU 1] unblocks... panic
> When the lock was WLOCK, all contenders were resumed in a
> sequence they came to the lock. Now, that they are readers,
> once the lock is released, readers are resumed in a "random"
> order, and this allows tcp_discardcb to go before the old
> running callout, and this unmasks the panic.

 Highly interesting.  I should be able to reproduce that (will be useful
for testing the corresponding fix).

 Fix proposal:  If callout_async_drain() returns 0 (fail) (instead of 1
(success) here) when the callout cancellation is a success _but_ the
callout is current running, that should fix it.

 For the history:  It comes back to my old callout question:

 Does _callout_stop_safe() is allowed to return 1 (success) even if the
callout is still currently running;  a.k.a. it is not because you
successfully cancelled a callout that the callout is not currently running.

 We did propose a patch to make _callout_stop_safe() returns 0 (fail)
when the callout is currently running:

callout_stop() should return 0 when the callout is currently being
serviced and indeed unstoppable

 But this change impacted too many old code paths and was interesting
only for TCP timers and thus was abandoned.

 My 2 cents.


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to