Hi, On 6/20/16 11:55 AM, Julien Charbon wrote: > On 6/20/16 9:39 AM, Gleb Smirnoff wrote: >> On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: >> J> > Comparing stable/10 and head, I see two changes that could >> J> > affect that: >> J> > >> J> > - callout_async_drain >> J> > - switch to READ lock for inp info in tcp timers >> J> > >> J> > That's why you are in To, Julien and Hans :) >> J> > >> J> > We continue investigating, and I will keep you updated. >> J> > However, any help is welcome. I can share cores. >> >> Now, spending some time with cores and adding a bunch of >> extra CTRs, I have a sequence of events that lead to the >> panic. In short, the bug is in the callout system. It seems >> to be not relevant to the callout_async_drain, at least for >> now. The transition to READ lock unmasked the problem, that's >> why NetflixBSD 10 doesn't panic. >> >> The panic requires heavy contention on the TCP info lock. >> >> [CPU 1] the callout fires, tcp_timer_keep entered >> [CPU 1] blocks on INP_INFO_RLOCK(&V_tcbinfo); >> [CPU 2] schedules the callout >> [CPU 2] tcp_discardcb called >> [CPU 2] callout successfully canceled >> [CPU 2] tcpcb freed >> [CPU 1] unblocks... panic >> >> When the lock was WLOCK, all contenders were resumed in a >> sequence they came to the lock. Now, that they are readers, >> once the lock is released, readers are resumed in a "random" >> order, and this allows tcp_discardcb to go before the old >> running callout, and this unmasks the panic. > > Highly interesting. I should be able to reproduce that (will be useful > for testing the corresponding fix).
Finally, I was able to reproduce it (without glebius fix). The trick was to really lower TCP keep timer expiration: $ sysctl -a | grep tcp.keep net.inet.tcp.keepidle: 7200000 net.inet.tcp.keepintvl: 75000 net.inet.tcp.keepinit: 75000 net.inet.tcp.keepcnt: 8 $ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10" Password: net.inet.tcp.keepidle: 7200000 -> 10 net.inet.tcp.keepintvl: 75000 -> 50 net.inet.tcp.keepinit: 75000 -> 10 Note: It will certainly close all your ssh connections to the tested server. Now I will test in order: #1. glebius fix https://svnweb.freebsd.org/base?view=revision&revision=302350 #2. rss extra fix https://reviews.freebsd.org/D7135 #3. rrs TCP Timer cleanup https://reviews.freebsd.org/D7136 My panic for reference: Fatal trap 9: general protection fault while in kernel mode cpuid = 10; apic id = 28 [root@atlas-dl360-4 ~]# instruction pointer = 0x20:0xffffffff80c346f1 stack pointer = 0x28:0xfffffe1f29b848b0 frame pointer = 0x28:0xfffffe1f29b848e0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (swi4: clock (4)) trap number = 9 panic: general protection fault cpuid = 10 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe1f29b844a0 vpanic() at vpanic+0x182/frame 0xfffffe1f29b84520 panic() at panic+0x43/frame 0xfffffe1f29b84580 trap_fatal() at trap_fatal+0x351/frame 0xfffffe1f29b845e0 trap() at trap+0x820/frame 0xfffffe1f29b847f0 calltrap() at calltrap+0x8/frame 0xfffffe1f29b847f0 --- trap 0x9, rip = 0xffffffff80c346f1, rsp = 0xfffffe1f29b848c0, rbp = 0xfffffe1f29b848e0 --- tcp_timer_keep() at tcp_timer_keep+0x51/frame 0xfffffe1f29b848e0 softclock_call_cc() at softclock_call_cc+0x19c/frame 0xfffffe1f29b849c0 softclock() at softclock+0x47/frame 0xfffffe1f29b849e0 intr_event_execute_handlers() at intr_event_execute_handlers+0x96/frame 0xfffffe1f29b84a20 ithread_loop() at ithread_loop+0xa6/frame 0xfffffe1f29b84a70 fork_exit() at fork_exit+0x84/frame 0xfffffe1f29b84ab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe1f29b84ab0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- -- Julien
Description: OpenPGP digital signature