On 07/21/16 09:54, Julien Charbon wrote:
On 7/14/16 11:02 PM, Larry Rosenman wrote:
On 2016-07-14 12:01, Julien Charbon wrote:
On 6/20/16 11:55 AM, Julien Charbon wrote:
On 6/20/16 9:39 AM, Gleb Smirnoff wrote:
On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote:
J> > Comparing stable/10 and head, I see two changes that could
J> > affect that:
J> > - callout_async_drain
J> > - switch to READ lock for inp info in tcp timers
J> > That's why you are in To, Julien and Hans :)
J> > We continue investigating, and I will keep you updated.
J> > However, any help is welcome. I can share cores.
Now, spending some time with cores and adding a bunch of
extra CTRs, I have a sequence of events that lead to the
panic. In short, the bug is in the callout system. It seems
to be not relevant to the callout_async_drain, at least for
now. The transition to READ lock unmasked the problem, that's
why NetflixBSD 10 doesn't panic.
The panic requires heavy contention on the TCP info lock.
[CPU 1] the callout fires, tcp_timer_keep entered
[CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo);
[CPU 2] schedules the callout
[CPU 2] tcp_discardcb called
[CPU 2] callout successfully canceled
[CPU 2] tcpcb freed
[CPU 1] unblocks... panic
When the lock was WLOCK, all contenders were resumed in a
sequence they came to the lock. Now, that they are readers,
once the lock is released, readers are resumed in a "random"
order, and this allows tcp_discardcb to go before the old
running callout, and this unmasks the panic.
Highly interesting. I should be able to reproduce that (will be useful
for testing the corresponding fix).
Finally, I was able to reproduce it (without glebius fix). The trick
was to really lower TCP keep timer expiration:
$ sysctl -a | grep tcp.keep
$ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl
net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10"
net.inet.tcp.keepidle: 7200000 -> 10
net.inet.tcp.keepintvl: 75000 -> 50
net.inet.tcp.keepinit: 75000 -> 10
Note: It will certainly close all your ssh connections to the tested
Now I will test in order:
#1. glebius fix
#2. rss extra fix
#3. rrs TCP Timer cleanup
please see also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884
My tests result so far:
#1. r302350: First glebius TCP timer fix: No more TCP timer kernel
panic during 48h under 200k TCP query per second load.
Sadly I was unable to reproduce the issue described here:
panic: bogus refcnt 0 on lle 0xfffff80004608c00
#2. r303098: Got all kernel callout changes since r302350, (updates on
callout code are indeed always full of surprises):
No kernel panic either.
Still to test:
#3. rss extra fix (if still relevant now)
#4. rrs TCP Timer cleanup:
My 2 cents.
You should also check for memory leaks using "vmstat -m" .
email@example.com mailing list
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"