Re: panic with tcp timers
On 07/21/16 09:54, Julien Charbon wrote: Hi, On 7/14/16 11:02 PM, Larry Rosenman wrote: On 2016-07-14 12:01, Julien Charbon wrote: On 6/20/16 11:55 AM, Julien Charbon wrote: On 6/20/16 9:39 AM, Gleb Smirnoff wrote: On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: J> > Comparing stable/10 and head, I see two changes that could J> > affect that: J> > J> > - callout_async_drain J> > - switch to READ lock for inp info in tcp timers J> > J> > That's why you are in To, Julien and Hans :) J> > J> > We continue investigating, and I will keep you updated. J> > However, any help is welcome. I can share cores. Now, spending some time with cores and adding a bunch of extra CTRs, I have a sequence of events that lead to the panic. In short, the bug is in the callout system. It seems to be not relevant to the callout_async_drain, at least for now. The transition to READ lock unmasked the problem, that's why NetflixBSD 10 doesn't panic. The panic requires heavy contention on the TCP info lock. [CPU 1] the callout fires, tcp_timer_keep entered [CPU 1] blocks on INP_INFO_RLOCK(_tcbinfo); [CPU 2] schedules the callout [CPU 2] tcp_discardcb called [CPU 2] callout successfully canceled [CPU 2] tcpcb freed [CPU 1] unblocks... panic When the lock was WLOCK, all contenders were resumed in a sequence they came to the lock. Now, that they are readers, once the lock is released, readers are resumed in a "random" order, and this allows tcp_discardcb to go before the old running callout, and this unmasks the panic. Highly interesting. I should be able to reproduce that (will be useful for testing the corresponding fix). Finally, I was able to reproduce it (without glebius fix). The trick was to really lower TCP keep timer expiration: $ sysctl -a | grep tcp.keep net.inet.tcp.keepidle: 720 net.inet.tcp.keepintvl: 75000 net.inet.tcp.keepinit: 75000 net.inet.tcp.keepcnt: 8 $ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10" Password: net.inet.tcp.keepidle: 720 -> 10 net.inet.tcp.keepintvl: 75000 -> 50 net.inet.tcp.keepinit: 75000 -> 10 Note: It will certainly close all your ssh connections to the tested server. Now I will test in order: #1. glebius fix https://svnweb.freebsd.org/base?view=revision=302350 #2. rss extra fix https://reviews.freebsd.org/D7135 #3. rrs TCP Timer cleanup https://reviews.freebsd.org/D7136 please see also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884 My tests result so far: #1. r302350: First glebius TCP timer fix: No more TCP timer kernel panic during 48h under 200k TCP query per second load. Sadly I was unable to reproduce the issue described here: panic: bogus refcnt 0 on lle 0xf80004608c00 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884 #2. r303098: Got all kernel callout changes since r302350, (updates on callout code are indeed always full of surprises): https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?view=log=303098 No kernel panic either. Still to test: #3. rss extra fix (if still relevant now) https://reviews.freebsd.org/D7135 #4. rrs TCP Timer cleanup: https://reviews.freebsd.org/D7136 My 2 cents. Hi, You should also check for memory leaks using "vmstat -m" . --HPS ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
Hi, On 7/14/16 11:02 PM, Larry Rosenman wrote: > On 2016-07-14 12:01, Julien Charbon wrote: >> On 6/20/16 11:55 AM, Julien Charbon wrote: >>> On 6/20/16 9:39 AM, Gleb Smirnoff wrote: On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: J> > Comparing stable/10 and head, I see two changes that could J> > affect that: J> > J> > - callout_async_drain J> > - switch to READ lock for inp info in tcp timers J> > J> > That's why you are in To, Julien and Hans :) J> > J> > We continue investigating, and I will keep you updated. J> > However, any help is welcome. I can share cores. Now, spending some time with cores and adding a bunch of extra CTRs, I have a sequence of events that lead to the panic. In short, the bug is in the callout system. It seems to be not relevant to the callout_async_drain, at least for now. The transition to READ lock unmasked the problem, that's why NetflixBSD 10 doesn't panic. The panic requires heavy contention on the TCP info lock. [CPU 1] the callout fires, tcp_timer_keep entered [CPU 1] blocks on INP_INFO_RLOCK(_tcbinfo); [CPU 2] schedules the callout [CPU 2] tcp_discardcb called [CPU 2] callout successfully canceled [CPU 2] tcpcb freed [CPU 1] unblocks... panic When the lock was WLOCK, all contenders were resumed in a sequence they came to the lock. Now, that they are readers, once the lock is released, readers are resumed in a "random" order, and this allows tcp_discardcb to go before the old running callout, and this unmasks the panic. >>> >>> Highly interesting. I should be able to reproduce that (will be useful >>> for testing the corresponding fix). >> >> Finally, I was able to reproduce it (without glebius fix). The trick >> was to really lower TCP keep timer expiration: >> >> $ sysctl -a | grep tcp.keep >> net.inet.tcp.keepidle: 720 >> net.inet.tcp.keepintvl: 75000 >> net.inet.tcp.keepinit: 75000 >> net.inet.tcp.keepcnt: 8 >> $ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl >> net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10" >> Password: >> net.inet.tcp.keepidle: 720 -> 10 >> net.inet.tcp.keepintvl: 75000 -> 50 >> net.inet.tcp.keepinit: 75000 -> 10 >> >> Note: It will certainly close all your ssh connections to the tested >> server. >> >> Now I will test in order: >> >> #1. glebius fix >> https://svnweb.freebsd.org/base?view=revision=302350 >> >> #2. rss extra fix >> https://reviews.freebsd.org/D7135 >> >> #3. rrs TCP Timer cleanup >> https://reviews.freebsd.org/D7136 > > please see also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884 My tests result so far: #1. r302350: First glebius TCP timer fix: No more TCP timer kernel panic during 48h under 200k TCP query per second load. Sadly I was unable to reproduce the issue described here: panic: bogus refcnt 0 on lle 0xf80004608c00 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884 #2. r303098: Got all kernel callout changes since r302350, (updates on callout code are indeed always full of surprises): https://svnweb.freebsd.org/base/head/sys/kern/kern_timeout.c?view=log=303098 No kernel panic either. Still to test: #3. rss extra fix (if still relevant now) https://reviews.freebsd.org/D7135 #4. rrs TCP Timer cleanup: https://reviews.freebsd.org/D7136 My 2 cents. -- Julien signature.asc Description: OpenPGP digital signature
Re: panic with tcp timers
On 2016-07-14 12:01, Julien Charbon wrote: Hi, On 6/20/16 11:55 AM, Julien Charbon wrote: On 6/20/16 9:39 AM, Gleb Smirnoff wrote: On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: J> > Comparing stable/10 and head, I see two changes that could J> > affect that: J> > J> > - callout_async_drain J> > - switch to READ lock for inp info in tcp timers J> > J> > That's why you are in To, Julien and Hans :) J> > J> > We continue investigating, and I will keep you updated. J> > However, any help is welcome. I can share cores. Now, spending some time with cores and adding a bunch of extra CTRs, I have a sequence of events that lead to the panic. In short, the bug is in the callout system. It seems to be not relevant to the callout_async_drain, at least for now. The transition to READ lock unmasked the problem, that's why NetflixBSD 10 doesn't panic. The panic requires heavy contention on the TCP info lock. [CPU 1] the callout fires, tcp_timer_keep entered [CPU 1] blocks on INP_INFO_RLOCK(_tcbinfo); [CPU 2] schedules the callout [CPU 2] tcp_discardcb called [CPU 2] callout successfully canceled [CPU 2] tcpcb freed [CPU 1] unblocks... panic When the lock was WLOCK, all contenders were resumed in a sequence they came to the lock. Now, that they are readers, once the lock is released, readers are resumed in a "random" order, and this allows tcp_discardcb to go before the old running callout, and this unmasks the panic. Highly interesting. I should be able to reproduce that (will be useful for testing the corresponding fix). Finally, I was able to reproduce it (without glebius fix). The trick was to really lower TCP keep timer expiration: $ sysctl -a | grep tcp.keep net.inet.tcp.keepidle: 720 net.inet.tcp.keepintvl: 75000 net.inet.tcp.keepinit: 75000 net.inet.tcp.keepcnt: 8 $ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10" Password: net.inet.tcp.keepidle: 720 -> 10 net.inet.tcp.keepintvl: 75000 -> 50 net.inet.tcp.keepinit: 75000 -> 10 Note: It will certainly close all your ssh connections to the tested server. Now I will test in order: #1. glebius fix https://svnweb.freebsd.org/base?view=revision=302350 #2. rss extra fix https://reviews.freebsd.org/D7135 #3. rrs TCP Timer cleanup https://reviews.freebsd.org/D7136 My panic for reference: Fatal trap 9: general protection fault while in kernel mode cpuid = 10; apic id = 28 [root@atlas-dl360-4 ~]# instruction pointer = 0x20:0x80c346f1 stack pointer = 0x28:0xfe1f29b848b0 frame pointer = 0x28:0xfe1f29b848e0 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags= interrupt enabled, resume, IOPL = 0 current process = 12 (swi4: clock (4)) trap number = 9 panic: general protection fault cpuid = 10 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe1f29b844a0 vpanic() at vpanic+0x182/frame 0xfe1f29b84520 panic() at panic+0x43/frame 0xfe1f29b84580 trap_fatal() at trap_fatal+0x351/frame 0xfe1f29b845e0 trap() at trap+0x820/frame 0xfe1f29b847f0 calltrap() at calltrap+0x8/frame 0xfe1f29b847f0 --- trap 0x9, rip = 0x80c346f1, rsp = 0xfe1f29b848c0, rbp = 0xfe1f29b848e0 --- tcp_timer_keep() at tcp_timer_keep+0x51/frame 0xfe1f29b848e0 softclock_call_cc() at softclock_call_cc+0x19c/frame 0xfe1f29b849c0 softclock() at softclock+0x47/frame 0xfe1f29b849e0 intr_event_execute_handlers() at intr_event_execute_handlers+0x96/frame 0xfe1f29b84a20 ithread_loop() at ithread_loop+0xa6/frame 0xfe1f29b84a70 fork_exit() at fork_exit+0x84/frame 0xfe1f29b84ab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfe1f29b84ab0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- -- Julien please see also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884 -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: l...@lerctr.org US Mail: 17716 Limpia Crk, Round Rock, TX 78664-7281 ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
Hi, On 6/20/16 11:55 AM, Julien Charbon wrote: > On 6/20/16 9:39 AM, Gleb Smirnoff wrote: >> On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: >> J> > Comparing stable/10 and head, I see two changes that could >> J> > affect that: >> J> > >> J> > - callout_async_drain >> J> > - switch to READ lock for inp info in tcp timers >> J> > >> J> > That's why you are in To, Julien and Hans :) >> J> > >> J> > We continue investigating, and I will keep you updated. >> J> > However, any help is welcome. I can share cores. >> >> Now, spending some time with cores and adding a bunch of >> extra CTRs, I have a sequence of events that lead to the >> panic. In short, the bug is in the callout system. It seems >> to be not relevant to the callout_async_drain, at least for >> now. The transition to READ lock unmasked the problem, that's >> why NetflixBSD 10 doesn't panic. >> >> The panic requires heavy contention on the TCP info lock. >> >> [CPU 1] the callout fires, tcp_timer_keep entered >> [CPU 1] blocks on INP_INFO_RLOCK(_tcbinfo); >> [CPU 2] schedules the callout >> [CPU 2] tcp_discardcb called >> [CPU 2] callout successfully canceled >> [CPU 2] tcpcb freed >> [CPU 1] unblocks... panic >> >> When the lock was WLOCK, all contenders were resumed in a >> sequence they came to the lock. Now, that they are readers, >> once the lock is released, readers are resumed in a "random" >> order, and this allows tcp_discardcb to go before the old >> running callout, and this unmasks the panic. > > Highly interesting. I should be able to reproduce that (will be useful > for testing the corresponding fix). Finally, I was able to reproduce it (without glebius fix). The trick was to really lower TCP keep timer expiration: $ sysctl -a | grep tcp.keep net.inet.tcp.keepidle: 720 net.inet.tcp.keepintvl: 75000 net.inet.tcp.keepinit: 75000 net.inet.tcp.keepcnt: 8 $ sudo bash -c "sysctl net.inet.tcp.keepidle=10 && sysctl net.inet.tcp.keepintvl=50 && sysctl net.inet.tcp.keepinit=10" Password: net.inet.tcp.keepidle: 720 -> 10 net.inet.tcp.keepintvl: 75000 -> 50 net.inet.tcp.keepinit: 75000 -> 10 Note: It will certainly close all your ssh connections to the tested server. Now I will test in order: #1. glebius fix https://svnweb.freebsd.org/base?view=revision=302350 #2. rss extra fix https://reviews.freebsd.org/D7135 #3. rrs TCP Timer cleanup https://reviews.freebsd.org/D7136 My panic for reference: Fatal trap 9: general protection fault while in kernel mode cpuid = 10; apic id = 28 [root@atlas-dl360-4 ~]# instruction pointer = 0x20:0x80c346f1 stack pointer = 0x28:0xfe1f29b848b0 frame pointer = 0x28:0xfe1f29b848e0 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags= interrupt enabled, resume, IOPL = 0 current process = 12 (swi4: clock (4)) trap number = 9 panic: general protection fault cpuid = 10 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe1f29b844a0 vpanic() at vpanic+0x182/frame 0xfe1f29b84520 panic() at panic+0x43/frame 0xfe1f29b84580 trap_fatal() at trap_fatal+0x351/frame 0xfe1f29b845e0 trap() at trap+0x820/frame 0xfe1f29b847f0 calltrap() at calltrap+0x8/frame 0xfe1f29b847f0 --- trap 0x9, rip = 0x80c346f1, rsp = 0xfe1f29b848c0, rbp = 0xfe1f29b848e0 --- tcp_timer_keep() at tcp_timer_keep+0x51/frame 0xfe1f29b848e0 softclock_call_cc() at softclock_call_cc+0x19c/frame 0xfe1f29b849c0 softclock() at softclock+0x47/frame 0xfe1f29b849e0 intr_event_execute_handlers() at intr_event_execute_handlers+0x96/frame 0xfe1f29b84a20 ithread_loop() at ithread_loop+0xa6/frame 0xfe1f29b84a70 fork_exit() at fork_exit+0x84/frame 0xfe1f29b84ab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfe1f29b84ab0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- -- Julien signature.asc Description: OpenPGP digital signature
Re: EBR fix for life cycle races was Re: panic with tcp timers
On Tue, 28 Jun 2016 23:19:45 -0700 Matthew Macywrote > > > > On Tue, 28 Jun 2016 15:51:57 -0700 K. Macy wrote > > > On Tue, Jun 28, 2016 at 10:51 AM, Matthew Macy wrote: > > > > You guys should really look at Samy Bahra's epoch based reclamation. I > solved a similar problem in drm/linuxkpi using it. > > > > The point being that this is a bug in the TCP life cycle handling > > _not_ in callouts. Churning the callout interface is not the best / > > only solution. > > -M > > Please see see D7017/D7018 as an example for an ultimately more robust > solution than continued fiddling with the callout interface. > > https://reviews.freebsd.org/D7018 I realized that this shortens the race but still leaves one open from the time the callout lock is dropped to the time the epoch begins. I have a proposed fix to make read locks never block and to essentially close the race window. The next issue that comes up is that synchronize is called too often. I'll talk to Samy about it in a few hours and come up with a better design. -M > > > > On Tue, 28 Jun 2016 02:58:56 -0700 Julien Charbon > wrote > > > > > > > > Hi Randall, > > > > > > > > On 6/25/16 4:41 PM, Randall Stewart via freebsd-net wrote: > > > > > Ok > > > > > > > > > > Lets try this again with my source changed to my @freebsd.net :-) > > > > > > > > > > Now I am also attaching a patch for you Gleb, this will take some > poking to > > > > > get in to your NF-head since it incorporates some changes we made > earlier. > > > > > > > > > > I think this will fix the problem.. i.e. dealing with two locks in > the callout system (which it was > > > > > never meant to have done).. > > > > > > > > > > Note we probably can move the code to use the callout lock init > now.. but lets see if this works > > > > > on your setup on c096 and if so we can think about doing that. > > > > > > > > Thanks for proposing a patch. I believe your patch will work with > > > > callout lock init, but not without: You still have a use-after-free > > > > > issue on the tcpcb without callout lock init. > > > > > > > > The case being subtle as usual, let me try to describe that could > happen: > > > > > > > > With your patch we have: > > > > > > > > void > > > > tcp_timer_keep(void *xtp) > > > > { > > > > struct tcpcb *tp = xtp; > > > > struct tcptemp *t_template; > > > > struct inpcb *inp; > > > > CURVNET_SET(tp->t_vnet); > > > > #ifdef TCPDEBUG > > > > int ostate; > > > > > > > > ostate = tp->t_state; > > > > #endif > > > > inp = tp->t_inpcb; > > > > KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", > __func__, > > > > tp)); > > > > INP_WLOCK(inp); > > > > if (callout_pending(>t_timers->tt_keep) ### Use after > free > > > > of tp here > > > > !callout_active(>t_timers->tt_keep)) { > > > > INP_WUNLOCK(inp); > > > > CURVNET_RESTORE(); > > > > return; > > > > } > > > > ... > > > > > > > > The use-after-free scenario: > > > > > > > > [CPU 1] the callout fires, tcp_timer_keep entered > > > > [CPU 1] blocks on INP_WLOCK(inp); > > > > [CPU 2] schedules tcp_timer_keep with callout_reset() > > > > [CPU 2] tcp_discardcb called > > > > [CPU 2] tcp_timer_keep callout successfully canceled > > > > [CPU 2] tcpcb freed > > > > [CPU 1] unblocks, the tcpcb is used > > > > > > > > Then the tcpcb will used just after being freed... Might also > crash or > > > > not depending in the case. > > > > > > > > Extra notes: > > > > > > > > o The invariant I see here is: The "callout successfully canceled" > > > > > step should never happen when "the callout is currently being > executed". > > > > > > > > o Solutions I see to enforce this invariant: > > > > > > > > - First solution: Use callout lock init with inp lock, your patch > > > > seems to permit that now. > > > > > > > > - Second solution: Change callout_async_drain() behavior: It can > > > > return 0 (fail) when the callout is currently being executed (no > matter > > > > what). > > > > > > > > - Third solution: Don't trust callout_async_drain(callout) return > > > > value of 1 (success) if the previous call of callout_reset(callout) > > > > returned 0 (fail). That was the exact purpose of r284261 change, > but > > > > this solution is also step backward in
EBR fix for life cycle races was Re: panic with tcp timers
On Tue, 28 Jun 2016 15:51:57 -0700 K. Macywrote > On Tue, Jun 28, 2016 at 10:51 AM, Matthew Macy wrote: > > You guys should really look at Samy Bahra's epoch based reclamation. I > > solved a similar problem in drm/linuxkpi using it. > > The point being that this is a bug in the TCP life cycle handling > _not_ in callouts. Churning the callout interface is not the best / > only solution. > -M Please see see D7017/D7018 as an example for an ultimately more robust solution than continued fiddling with the callout interface. https://reviews.freebsd.org/D7018 Cheers. -M > > On Tue, 28 Jun 2016 02:58:56 -0700 Julien Charbon > > wrote > > > > > > Hi Randall, > > > > > > On 6/25/16 4:41 PM, Randall Stewart via freebsd-net wrote: > > > > Ok > > > > > > > > Lets try this again with my source changed to my @freebsd.net :-) > > > > > > > > Now I am also attaching a patch for you Gleb, this will take some > > poking to > > > > get in to your NF-head since it incorporates some changes we made > > earlier. > > > > > > > > I think this will fix the problem.. i.e. dealing with two locks in > > the callout system (which it was > > > > never meant to have done).. > > > > > > > > Note we probably can move the code to use the callout lock init now.. > > but lets see if this works > > > > on your setup on c096 and if so we can think about doing that. > > > > > > Thanks for proposing a patch. I believe your patch will work with > > > callout lock init, but not without: You still have a use-after-free > > > issue on the tcpcb without callout lock init. > > > > > > The case being subtle as usual, let me try to describe that could > > happen: > > > > > > With your patch we have: > > > > > > void > > > tcp_timer_keep(void *xtp) > > > { > > > struct tcpcb *tp = xtp; > > > struct tcptemp *t_template; > > > struct inpcb *inp; > > > CURVNET_SET(tp->t_vnet); > > > #ifdef TCPDEBUG > > > int ostate; > > > > > > ostate = tp->t_state; > > > #endif > > > inp = tp->t_inpcb; > > > KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", > > __func__, > > > tp)); > > > INP_WLOCK(inp); > > > if (callout_pending(>t_timers->tt_keep) ### Use after free > > > of tp here > > > !callout_active(>t_timers->tt_keep)) { > > > INP_WUNLOCK(inp); > > > CURVNET_RESTORE(); > > > return; > > > } > > > ... > > > > > > The use-after-free scenario: > > > > > > [CPU 1] the callout fires, tcp_timer_keep entered > > > [CPU 1] blocks on INP_WLOCK(inp); > > > [CPU 2] schedules tcp_timer_keep with callout_reset() > > > [CPU 2] tcp_discardcb called > > > [CPU 2] tcp_timer_keep callout successfully canceled > > > [CPU 2] tcpcb freed > > > [CPU 1] unblocks, the tcpcb is used > > > > > > Then the tcpcb will used just after being freed... Might also crash > > or > > > not depending in the case. > > > > > > Extra notes: > > > > > > o The invariant I see here is: The "callout successfully canceled" > > > step should never happen when "the callout is currently being > > executed". > > > > > > o Solutions I see to enforce this invariant: > > > > > > - First solution: Use callout lock init with inp lock, your patch > > > seems to permit that now. > > > > > > - Second solution: Change callout_async_drain() behavior: It can > > > return 0 (fail) when the callout is currently being executed (no matter > > > what). > > > > > > - Third solution: Don't trust callout_async_drain(callout) return > > > value of 1 (success) if the previous call of callout_reset(callout) > > > returned 0 (fail). That was the exact purpose of r284261 change, but > > > this solution is also step backward in modernization of TCP > > > timers/callout... > > > > > > > > https://svnweb.freebsd.org/base/stable/10/sys/netinet/tcp_timer.c?r1=284261=284260=284261 > > > > > > > > Hopefully my description is clear enough... > > > > > > -- > > > Julien > > > > > > > > > > ___ > > freebsd-current@freebsd.org mailing list > > https://lists.freebsd.org/mailman/listinfo/freebsd-current > > To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" > ___ > freebsd-current@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" > ___ freebsd-current@freebsd.org mailing list
Re: panic with tcp timers
On Tue, Jun 28, 2016 at 10:51 AM, Matthew Macywrote: > You guys should really look at Samy Bahra's epoch based reclamation. I solved > a similar problem in drm/linuxkpi using it. The point being that this is a bug in the TCP life cycle handling _not_ in callouts. Churning the callout interface is not the best / only solution. -M > > On Tue, 28 Jun 2016 02:58:56 -0700 Julien Charbon > wrote > > > > Hi Randall, > > > > On 6/25/16 4:41 PM, Randall Stewart via freebsd-net wrote: > > > Ok > > > > > > Lets try this again with my source changed to my @freebsd.net :-) > > > > > > Now I am also attaching a patch for you Gleb, this will take some poking > to > > > get in to your NF-head since it incorporates some changes we made > earlier. > > > > > > I think this will fix the problem.. i.e. dealing with two locks in the > callout system (which it was > > > never meant to have done).. > > > > > > Note we probably can move the code to use the callout lock init now.. > but lets see if this works > > > on your setup on c096 and if so we can think about doing that. > > > > Thanks for proposing a patch. I believe your patch will work with > > callout lock init, but not without: You still have a use-after-free > > issue on the tcpcb without callout lock init. > > > > The case being subtle as usual, let me try to describe that could happen: > > > > With your patch we have: > > > > void > > tcp_timer_keep(void *xtp) > > { > > struct tcpcb *tp = xtp; > > struct tcptemp *t_template; > > struct inpcb *inp; > > CURVNET_SET(tp->t_vnet); > > #ifdef TCPDEBUG > > int ostate; > > > > ostate = tp->t_state; > > #endif > > inp = tp->t_inpcb; > > KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", __func__, > > tp)); > > INP_WLOCK(inp); > > if (callout_pending(>t_timers->tt_keep) ### Use after free > > of tp here > > !callout_active(>t_timers->tt_keep)) { > > INP_WUNLOCK(inp); > > CURVNET_RESTORE(); > > return; > > } > > ... > > > > The use-after-free scenario: > > > > [CPU 1] the callout fires, tcp_timer_keep entered > > [CPU 1] blocks on INP_WLOCK(inp); > > [CPU 2] schedules tcp_timer_keep with callout_reset() > > [CPU 2] tcp_discardcb called > > [CPU 2] tcp_timer_keep callout successfully canceled > > [CPU 2] tcpcb freed > > [CPU 1] unblocks, the tcpcb is used > > > > Then the tcpcb will used just after being freed... Might also crash or > > not depending in the case. > > > > Extra notes: > > > > o The invariant I see here is: The "callout successfully canceled" > > step should never happen when "the callout is currently being executed". > > > > o Solutions I see to enforce this invariant: > > > > - First solution: Use callout lock init with inp lock, your patch > > seems to permit that now. > > > > - Second solution: Change callout_async_drain() behavior: It can > > return 0 (fail) when the callout is currently being executed (no matter > > what). > > > > - Third solution: Don't trust callout_async_drain(callout) return > > value of 1 (success) if the previous call of callout_reset(callout) > > returned 0 (fail). That was the exact purpose of r284261 change, but > > this solution is also step backward in modernization of TCP > > timers/callout... > > > > > https://svnweb.freebsd.org/base/stable/10/sys/netinet/tcp_timer.c?r1=284261=284260=284261 > > > > Hopefully my description is clear enough... > > > > -- > > Julien > > > > > > ___ > freebsd-current@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
You guys should really look at Samy Bahra's epoch based reclamation. I solved a similar problem in drm/linuxkpi using it. -M On Tue, 28 Jun 2016 02:58:56 -0700 Julien Charbonwrote > > Hi Randall, > > On 6/25/16 4:41 PM, Randall Stewart via freebsd-net wrote: > > Ok > > > > Lets try this again with my source changed to my @freebsd.net :-) > > > > Now I am also attaching a patch for you Gleb, this will take some poking > > to > > get in to your NF-head since it incorporates some changes we made earlier. > > > > I think this will fix the problem.. i.e. dealing with two locks in the > > callout system (which it was > > never meant to have done).. > > > > Note we probably can move the code to use the callout lock init now.. but > > lets see if this works > > on your setup on c096 and if so we can think about doing that. > > Thanks for proposing a patch. I believe your patch will work with > callout lock init, but not without: You still have a use-after-free > issue on the tcpcb without callout lock init. > > The case being subtle as usual, let me try to describe that could happen: > > With your patch we have: > > void > tcp_timer_keep(void *xtp) > { > struct tcpcb *tp = xtp; > struct tcptemp *t_template; > struct inpcb *inp; > CURVNET_SET(tp->t_vnet); > #ifdef TCPDEBUG > int ostate; > > ostate = tp->t_state; > #endif > inp = tp->t_inpcb; > KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", __func__, > tp)); > INP_WLOCK(inp); > if (callout_pending(>t_timers->tt_keep) ### Use after free > of tp here > !callout_active(>t_timers->tt_keep)) { > INP_WUNLOCK(inp); > CURVNET_RESTORE(); > return; > } > ... > > The use-after-free scenario: > > [CPU 1] the callout fires, tcp_timer_keep entered > [CPU 1] blocks on INP_WLOCK(inp); > [CPU 2] schedules tcp_timer_keep with callout_reset() > [CPU 2] tcp_discardcb called > [CPU 2] tcp_timer_keep callout successfully canceled > [CPU 2] tcpcb freed > [CPU 1] unblocks, the tcpcb is used > > Then the tcpcb will used just after being freed... Might also crash or > not depending in the case. > > Extra notes: > > o The invariant I see here is: The "callout successfully canceled" > step should never happen when "the callout is currently being executed". > > o Solutions I see to enforce this invariant: > > - First solution: Use callout lock init with inp lock, your patch > seems to permit that now. > > - Second solution: Change callout_async_drain() behavior: It can > return 0 (fail) when the callout is currently being executed (no matter > what). > > - Third solution: Don't trust callout_async_drain(callout) return > value of 1 (success) if the previous call of callout_reset(callout) > returned 0 (fail). That was the exact purpose of r284261 change, but > this solution is also step backward in modernization of TCP > timers/callout... > > https://svnweb.freebsd.org/base/stable/10/sys/netinet/tcp_timer.c?r1=284261=284260=284261 > > > Hopefully my description is clear enough... > > -- > Julien > > ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
Hi Randall, On 6/25/16 4:41 PM, Randall Stewart via freebsd-net wrote: > Ok > > Lets try this again with my source changed to my @freebsd.net :-) > > Now I am also attaching a patch for you Gleb, this will take some poking to > get in to your NF-head since it incorporates some changes we made earlier. > > I think this will fix the problem.. i.e. dealing with two locks in the > callout system (which it was > never meant to have done).. > > Note we probably can move the code to use the callout lock init now.. but > lets see if this works > on your setup on c096 and if so we can think about doing that. Thanks for proposing a patch. I believe your patch will work with callout lock init, but not without: You still have a use-after-free issue on the tcpcb without callout lock init. The case being subtle as usual, let me try to describe that could happen: With your patch we have: void tcp_timer_keep(void *xtp) { struct tcpcb *tp = xtp; struct tcptemp *t_template; struct inpcb *inp; CURVNET_SET(tp->t_vnet); #ifdef TCPDEBUG int ostate; ostate = tp->t_state; #endif inp = tp->t_inpcb; KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", __func__, tp)); INP_WLOCK(inp); if (callout_pending(>t_timers->tt_keep) ### Use after free of tp here !callout_active(>t_timers->tt_keep)) { INP_WUNLOCK(inp); CURVNET_RESTORE(); return; } ... The use-after-free scenario: [CPU 1] the callout fires, tcp_timer_keep entered [CPU 1] blocks on INP_WLOCK(inp); [CPU 2] schedules tcp_timer_keep with callout_reset() [CPU 2] tcp_discardcb called [CPU 2] tcp_timer_keep callout successfully canceled [CPU 2] tcpcb freed [CPU 1] unblocks, the tcpcb is used Then the tcpcb will used just after being freed... Might also crash or not depending in the case. Extra notes: o The invariant I see here is: The "callout successfully canceled" step should never happen when "the callout is currently being executed". o Solutions I see to enforce this invariant: - First solution: Use callout lock init with inp lock, your patch seems to permit that now. - Second solution: Change callout_async_drain() behavior: It can return 0 (fail) when the callout is currently being executed (no matter what). - Third solution: Don't trust callout_async_drain(callout) return value of 1 (success) if the previous call of callout_reset(callout) returned 0 (fail). That was the exact purpose of r284261 change, but this solution is also step backward in modernization of TCP timers/callout... https://svnweb.freebsd.org/base/stable/10/sys/netinet/tcp_timer.c?r1=284261=284260=284261 Hopefully my description is clear enough... -- Julien signature.asc Description: OpenPGP digital signature
Re: panic with tcp timers
Ok Lets try this again with my source changed to my @freebsd.net :-) Now I am also attaching a patch for you Gleb, this will take some poking to get in to your NF-head since it incorporates some changes we made earlier. I think this will fix the problem.. i.e. dealing with two locks in the callout system (which it was never meant to have done).. Note we probably can move the code to use the callout lock init now.. but lets see if this works on your setup on c096 and if so we can think about doing that. R for_gleb Description: Binary data > On Jun 25, 2016, at 4:48 AM, Randall Stewartwrote: > > So > > All of our timers in TCP do something like > - > INFO-LOCK > INP_WLOCK > > if (inp needs to be dropped) { >drop-it > } > do other work > > UNLOCK-INP > UNLOCK-INFO > -- > > And generally the path “inp needs to be dropped” is rarely taken. > > So why don’t we change the procedure to something like: > > INP_WLOCK > if (inp needs to be dropped) { > inp_ref() > INP_WUNLOCK() > INFO_LOCK() > INP_WLOCK() > if (inp_release_ref) { > /* victory its dead */ > INFO_UNLOCK > return > } > do-the-drop > } > > This way we would only go grab the INFO lock in those rarer cases > when we *do* actually want to kill the tcb and at the same time > it would make the current callout system work correctly.. which as > many have pointed out would be much better if we could just let the > lock be gotten by the callout system. Hmm maybe with this scheme we > could even do that... > > R > > >> On Jun 20, 2016, at 1:45 PM, Julien Charbon wrote: >> >> >> Hi, >> >> On 6/20/16 11:58 AM, Gleb Smirnoff wrote: >>> On Mon, Jun 20, 2016 at 11:55:55AM +0200, Julien Charbon wrote: >>> J> > On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: >>> J> > J> > Comparing stable/10 and head, I see two changes that could >>> J> > J> > affect that: >>> J> > J> > >>> J> > J> > - callout_async_drain >>> J> > J> > - switch to READ lock for inp info in tcp timers >>> J> > J> > >>> J> > J> > That's why you are in To, Julien and Hans :) >>> J> > J> > >>> J> > J> > We continue investigating, and I will keep you updated. >>> J> > J> > However, any help is welcome. I can share cores. >>> J> > >>> J> > Now, spending some time with cores and adding a bunch of >>> J> > extra CTRs, I have a sequence of events that lead to the >>> J> > panic. In short, the bug is in the callout system. It seems >>> J> > to be not relevant to the callout_async_drain, at least for >>> J> > now. The transition to READ lock unmasked the problem, that's >>> J> > why NetflixBSD 10 doesn't panic. >>> J> > >>> J> > The panic requires heavy contention on the TCP info lock. >>> J> > >>> J> > [CPU 1] the callout fires, tcp_timer_keep entered >>> J> > [CPU 1] blocks on INP_INFO_RLOCK(_tcbinfo); >>> J> > [CPU 2] schedules the callout >>> J> > [CPU 2] tcp_discardcb called >>> J> > [CPU 2] callout successfully canceled >>> J> > [CPU 2] tcpcb freed >>> J> > [CPU 1] unblocks... panic >>> J> > >>> J> > When the lock was WLOCK, all contenders were resumed in a >>> J> > sequence they came to the lock. Now, that they are readers, >>> J> > once the lock is released, readers are resumed in a "random" >>> J> > order, and this allows tcp_discardcb to go before the old >>> J> > running callout, and this unmasks the panic. >>> J> >>> J> Highly interesting. I should be able to reproduce that (will be useful >>> J> for testing the corresponding fix). >>> J> >>> J> Fix proposal: If callout_async_drain() returns 0 (fail) (instead of 1 >>> J> (success) here) when the callout cancellation is a success _but_ the >>> J> callout is current running, that should fix it. >>> J> >>> J> For the history: It comes back to my old callout question: >>> J> >>> J> Does _callout_stop_safe() is allowed to return 1 (success) even if the >>> J> callout is still currently running; a.k.a. it is not because you >>> J> successfully cancelled a callout that the callout is not currently >>> running. >>> J> >>> J> We did propose a patch to make _callout_stop_safe() returns 0 (fail) >>> J> when the callout is currently running: >>> J> >>> J> callout_stop() should return 0 when the callout is currently being >>> J> serviced and indeed unstoppable >>> J> >>> https://reviews.freebsd.org/differential/changeset/?ref=62513=ignore-most >>> J> >>> J> But this change impacted too many old code paths and was interesting >>> J> only for TCP timers and thus was abandoned. >>> >>> The fix I am working on now is doing exactly that. callout_reset must >>> return 0 if the callout is currently running. >>> >>> What are the old paths impacted? >> >> Actually all the paths that check the callout_stop() return value AND >> call both callout_reset() and callout_stop() AND use mpsafe callout(). >> And for each path, we would have to check our patch was ok (or not).
Re: panic with tcp timers
So All of our timers in TCP do something like - INFO-LOCK INP_WLOCK if (inp needs to be dropped) { drop-it } do other work UNLOCK-INP UNLOCK-INFO -- And generally the path “inp needs to be dropped” is rarely taken. So why don’t we change the procedure to something like: INP_WLOCK if (inp needs to be dropped) { inp_ref() INP_WUNLOCK() INFO_LOCK() INP_WLOCK() if (inp_release_ref) { /* victory its dead */ INFO_UNLOCK return } do-the-drop } This way we would only go grab the INFO lock in those rarer cases when we *do* actually want to kill the tcb and at the same time it would make the current callout system work correctly.. which as many have pointed out would be much better if we could just let the lock be gotten by the callout system. Hmm maybe with this scheme we could even do that... R > On Jun 20, 2016, at 1:45 PM, Julien Charbonwrote: > > > Hi, > > On 6/20/16 11:58 AM, Gleb Smirnoff wrote: >> On Mon, Jun 20, 2016 at 11:55:55AM +0200, Julien Charbon wrote: >> J> > On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: >> J> > J> > Comparing stable/10 and head, I see two changes that could >> J> > J> > affect that: >> J> > J> > >> J> > J> > - callout_async_drain >> J> > J> > - switch to READ lock for inp info in tcp timers >> J> > J> > >> J> > J> > That's why you are in To, Julien and Hans :) >> J> > J> > >> J> > J> > We continue investigating, and I will keep you updated. >> J> > J> > However, any help is welcome. I can share cores. >> J> > >> J> > Now, spending some time with cores and adding a bunch of >> J> > extra CTRs, I have a sequence of events that lead to the >> J> > panic. In short, the bug is in the callout system. It seems >> J> > to be not relevant to the callout_async_drain, at least for >> J> > now. The transition to READ lock unmasked the problem, that's >> J> > why NetflixBSD 10 doesn't panic. >> J> > >> J> > The panic requires heavy contention on the TCP info lock. >> J> > >> J> > [CPU 1] the callout fires, tcp_timer_keep entered >> J> > [CPU 1] blocks on INP_INFO_RLOCK(_tcbinfo); >> J> > [CPU 2] schedules the callout >> J> > [CPU 2] tcp_discardcb called >> J> > [CPU 2] callout successfully canceled >> J> > [CPU 2] tcpcb freed >> J> > [CPU 1] unblocks... panic >> J> > >> J> > When the lock was WLOCK, all contenders were resumed in a >> J> > sequence they came to the lock. Now, that they are readers, >> J> > once the lock is released, readers are resumed in a "random" >> J> > order, and this allows tcp_discardcb to go before the old >> J> > running callout, and this unmasks the panic. >> J> >> J> Highly interesting. I should be able to reproduce that (will be useful >> J> for testing the corresponding fix). >> J> >> J> Fix proposal: If callout_async_drain() returns 0 (fail) (instead of 1 >> J> (success) here) when the callout cancellation is a success _but_ the >> J> callout is current running, that should fix it. >> J> >> J> For the history: It comes back to my old callout question: >> J> >> J> Does _callout_stop_safe() is allowed to return 1 (success) even if the >> J> callout is still currently running; a.k.a. it is not because you >> J> successfully cancelled a callout that the callout is not currently >> running. >> J> >> J> We did propose a patch to make _callout_stop_safe() returns 0 (fail) >> J> when the callout is currently running: >> J> >> J> callout_stop() should return 0 when the callout is currently being >> J> serviced and indeed unstoppable >> J> >> https://reviews.freebsd.org/differential/changeset/?ref=62513=ignore-most >> J> >> J> But this change impacted too many old code paths and was interesting >> J> only for TCP timers and thus was abandoned. >> >> The fix I am working on now is doing exactly that. callout_reset must >> return 0 if the callout is currently running. >> >> What are the old paths impacted? > > Actually all the paths that check the callout_stop() return value AND > call both callout_reset() and callout_stop() AND use mpsafe callout(). > And for each path, we would have to check our patch was ok (or not). > > Thus, if you only do the change in callout_async_drain() context, you > don't impact the "old" callout_stop() behavior and get the desired > behavior for the TCP timers. Might be a good trade-off... > > My 2 cents. > > -- > Julien Randall Stewart r...@netflix.com 803-317-4952 ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
Hi, On 6/20/16 11:58 AM, Gleb Smirnoff wrote: > On Mon, Jun 20, 2016 at 11:55:55AM +0200, Julien Charbon wrote: > J> > On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: > J> > J> > Comparing stable/10 and head, I see two changes that could > J> > J> > affect that: > J> > J> > > J> > J> > - callout_async_drain > J> > J> > - switch to READ lock for inp info in tcp timers > J> > J> > > J> > J> > That's why you are in To, Julien and Hans :) > J> > J> > > J> > J> > We continue investigating, and I will keep you updated. > J> > J> > However, any help is welcome. I can share cores. > J> > > J> > Now, spending some time with cores and adding a bunch of > J> > extra CTRs, I have a sequence of events that lead to the > J> > panic. In short, the bug is in the callout system. It seems > J> > to be not relevant to the callout_async_drain, at least for > J> > now. The transition to READ lock unmasked the problem, that's > J> > why NetflixBSD 10 doesn't panic. > J> > > J> > The panic requires heavy contention on the TCP info lock. > J> > > J> > [CPU 1] the callout fires, tcp_timer_keep entered > J> > [CPU 1] blocks on INP_INFO_RLOCK(_tcbinfo); > J> > [CPU 2] schedules the callout > J> > [CPU 2] tcp_discardcb called > J> > [CPU 2] callout successfully canceled > J> > [CPU 2] tcpcb freed > J> > [CPU 1] unblocks... panic > J> > > J> > When the lock was WLOCK, all contenders were resumed in a > J> > sequence they came to the lock. Now, that they are readers, > J> > once the lock is released, readers are resumed in a "random" > J> > order, and this allows tcp_discardcb to go before the old > J> > running callout, and this unmasks the panic. > J> > J> Highly interesting. I should be able to reproduce that (will be useful > J> for testing the corresponding fix). > J> > J> Fix proposal: If callout_async_drain() returns 0 (fail) (instead of 1 > J> (success) here) when the callout cancellation is a success _but_ the > J> callout is current running, that should fix it. > J> > J> For the history: It comes back to my old callout question: > J> > J> Does _callout_stop_safe() is allowed to return 1 (success) even if the > J> callout is still currently running; a.k.a. it is not because you > J> successfully cancelled a callout that the callout is not currently running. > J> > J> We did propose a patch to make _callout_stop_safe() returns 0 (fail) > J> when the callout is currently running: > J> > J> callout_stop() should return 0 when the callout is currently being > J> serviced and indeed unstoppable > J> > https://reviews.freebsd.org/differential/changeset/?ref=62513=ignore-most > J> > J> But this change impacted too many old code paths and was interesting > J> only for TCP timers and thus was abandoned. > > The fix I am working on now is doing exactly that. callout_reset must > return 0 if the callout is currently running. > > What are the old paths impacted? Actually all the paths that check the callout_stop() return value AND call both callout_reset() and callout_stop() AND use mpsafe callout(). And for each path, we would have to check our patch was ok (or not). Thus, if you only do the change in callout_async_drain() context, you don't impact the "old" callout_stop() behavior and get the desired behavior for the TCP timers. Might be a good trade-off... My 2 cents. -- Julien ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
Hi, On 6/20/16 12:30 PM, Gleb Smirnoff wrote: > On Mon, Jun 20, 2016 at 12:14:18PM +0200, Hans Petter Selasky wrote: > H> On 06/20/16 11:58, Gleb Smirnoff wrote: > H> > The fix I am working on now is doing exactly that. callout_reset must > H> > return 0 if the callout is currently running. > H> > > H> > What are the old paths impacted? > H> > H> I'll dig into the matter aswell and give some comments. Thanks for the > H> analysis, Gleb. > H> > H> FYI: This class of problems wouldn't exist if the callout could be > H> associated with a mutex! > > Exactly! I am convinced that all callouts should be locked, and non-locked > one should simply go away, as well as async drain. > > What does prevent us from converting TCP timeouts to locked? To my > understanding it is the lock order of taking pcbinfo after pcb lock. > > I'm now trying to understand Julien's conversion from pcbinfo lock > to pcbinfo + pcblist locks. It seems to me that we actually can convert > TCP timers to locked callouts. > > What for do we need pcbinfo read lock in a tcp timer? The timer works > only on the pcb and doesn't modify global lists, does it? tcp_timer_keep() for example can modify global pcb list, see the call stack below: tcp_timer_keep() tcp_drop() tcp_close() sofree() tcp_usr_detach() (via pr->pr_usrreqs->pru_detach() in sofree()) tcp_detach() in_pcbfree() in_pcbremlists() Anyway, a bit of history here: o In stable/10 the TCP locking order is: ipi_lock (before) inpcb lock and in stable/10 ipi_lock is protecting the global pcb list. Then, only in the cases where you were absolutely sure you are _not_ going to modify the global pcb list you are allowed to take the inpcb lock only. For all the other cases, you have to take the write lock on ipi_lock first due to lock order. And in context of short-lived TCP connection of the 5 received packets for a TCP connection, 4 require the write ipi_lock lock, and that's does not scale well. Lesson learned: For scaling perspective, in lock ordering always put the most global lock last. It was improved in a lean way in 11: o In 11 the TCP lock order became: ipi_lock (before) inpcb lock (before) ipi_list And in 11 ipi_list protects global pcb list, and only the code actually modifying the global list is protected by a global write lock, e.g.: https://github.com/freebsd/freebsd/blob/master/sys/netinet/in_pcb.c#L1285 Then why keeping the ipi_lock lock at all now? It is solely for one case: When you need the complete stability of the full pcb list while traversing it. These full pcb list traversals are done in functions like: in_pcbnotifyall(), in_pcbpurgeif0(), inp_apply_all(), tcp_ccalgounload(), tcp_drain(), etc. Thus in 11 ipi_lock write lock is required only when you do this full traversal with list stability requirement, and the ipi_lock read lock in all other cases like tcp_input() that then scales better. Sadly in 11, you cannot use the inpcb lock as is for the TCP timers callout, because like tcp_timer_keep() most TCP timers callout can modify the global pcb list and then you need the read lock ipi_lock in top of inpcb lock... o Future/12: The next natural step is to remove the ipi_lock from the picture to get: inpcb lock (before) ipi_list It /just/ requires a global pcb list that allows addition/deletion while doing a full traversal concurrently. A classical way to achieve that, is to use a RCU-based list. As soon as RCU (or anything equivalent like lwref) are supported in kernel, we will implement this change. Just history here, as I already did a presentation on this exact topic at BSDCan 2015: https://wiki.freebsd.org/201506DevSummit#line-83 It was recorded but I never saw the footage/presentation actually published. :) -- Julien signature.asc Description: OpenPGP digital signature
Re: panic with tcp timers
There's implications for RSS with how the callout system currently works. If you don't know the RSS bucket ID of a connection in advance, you'll create callouts on the wrong CPUs and then they're not migrated. The initial work here did convert things over, but didn't place the callouts in the right CPU/RSS bucket and there wasn't a mechanism to fix this. :( (But I'm also a firm believer of that too. I'd also finally just like the callout system to not be tied to per-CPU queues, but also per-RSS-bucket callout wheels so we could assign a CPU mask to a given callout wheel and then migrating things around is just "change cpu mask", not "change callout cpu id.") -adrian On 20 June 2016 at 04:00, Hans Petter Selaskywrote: > On 06/20/16 12:30, Gleb Smirnoff wrote: >> >> What does prevent us from converting TCP timeouts to locked? To my >> understanding it is the lock order of taking pcbinfo after pcb lock. > > > I started this work: > > https://reviews.freebsd.org/D1563 > > --HPS > ___ > freebsd-...@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
On 06/20/16 12:30, Gleb Smirnoff wrote: What does prevent us from converting TCP timeouts to locked? To my understanding it is the lock order of taking pcbinfo after pcb lock. I started this work: https://reviews.freebsd.org/D1563 --HPS ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
On 06/20/16 12:30, Gleb Smirnoff wrote: Exactly! I am convinced that all callouts should be locked, and non-locked one should simply go away, as well as async drain. I agree about that that, except you still need the async drain, because it will prevent freeing the lock protecting the callout, which may still be in use after callout_stop(). --HPS ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
On Mon, Jun 20, 2016 at 11:55:55AM +0200, Julien Charbon wrote: > > Hi, > > On 6/20/16 9:39 AM, Gleb Smirnoff wrote: > > On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: > > J> > Comparing stable/10 and head, I see two changes that could > > J> > affect that: > > J> > > > J> > - callout_async_drain > > J> > - switch to READ lock for inp info in tcp timers > > J> > > > J> > That's why you are in To, Julien and Hans :) > > J> > > > J> > We continue investigating, and I will keep you updated. > > J> > However, any help is welcome. I can share cores. > > > > Now, spending some time with cores and adding a bunch of > > extra CTRs, I have a sequence of events that lead to the > > panic. In short, the bug is in the callout system. It seems > > to be not relevant to the callout_async_drain, at least for > > now. The transition to READ lock unmasked the problem, that's > > why NetflixBSD 10 doesn't panic. > > > > The panic requires heavy contention on the TCP info lock. > > > > [CPU 1] the callout fires, tcp_timer_keep entered > > [CPU 1] blocks on INP_INFO_RLOCK(_tcbinfo); > > [CPU 2] schedules the callout > > [CPU 2] tcp_discardcb called > > [CPU 2] callout successfully canceled > > [CPU 2] tcpcb freed > > [CPU 1] unblocks... panic > > > > When the lock was WLOCK, all contenders were resumed in a > > sequence they came to the lock. Now, that they are readers, > > once the lock is released, readers are resumed in a "random" > > order, and this allows tcp_discardcb to go before the old > > running callout, and this unmasks the panic. > > Highly interesting. I should be able to reproduce that (will be useful > for testing the corresponding fix). > > Fix proposal: If callout_async_drain() returns 0 (fail) (instead of 1 > (success) here) when the callout cancellation is a success _but_ the > callout is current running, that should fix it. > > For the history: It comes back to my old callout question: > > Does _callout_stop_safe() is allowed to return 1 (success) even if the > callout is still currently running; a.k.a. it is not because you > successfully cancelled a callout that the callout is not currently running. > > We did propose a patch to make _callout_stop_safe() returns 0 (fail) > when the callout is currently running: > > callout_stop() should return 0 when the callout is currently being > serviced and indeed unstoppable > https://reviews.freebsd.org/differential/changeset/?ref=62513=ignore-most > > But this change impacted too many old code paths and was interesting > only for TCP timers and thus was abandoned. Look at callout_stop CS_MIGRBLOCK flag and the fix in sleepq_check_timeout(). Or, at least, do not allow this use of callout_stop() to rot again, after previous dozen regressions and fixes there. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
On 06/20/16 11:58, Gleb Smirnoff wrote: J> callout_stop() should return 0 when the callout is currently being J> serviced and indeed unstoppable J> https://reviews.freebsd.org/differential/changeset/?ref=62513=ignore-most What are the old paths impacted? Hi Gleb, Digging through my e-mail archive rephrasing myself and comments about: https://reviews.freebsd.org/D3078 There are two kinds of callouts. So-called MPSAFE callouts which doesn't have a mutex associated with it, and non-MPSAFE which do. When you are associating a mutex with a callout, callout_stop() will _always_ cancel the callback even if it is pending, and this should be reflected by the return value. Your proposed changes will break that ??? The changes in D3078 didn't take into account "use_lock" at least, and so the return values for the non-MPSAFE case becomes incorrect. Hi, I think this patch is incomplete and can break the return value for non-MPSAFE callouts. I think the correct statement should check the value of "use_lock" too: not_running = !(cc_exec_curr(cc, direct) == c && use_lock == 0); Because if the callback process waits for lock "c_lock" in the callback process then we have above "cc_exec_curr(cc, direct) == c" is satisfied too, and non-MPSAFE callouts are always cancelable, via cc_exec_cancel(cc, direct) = true; --HPS ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
On Mon, Jun 20, 2016 at 12:14:18PM +0200, Hans Petter Selasky wrote: H> On 06/20/16 11:58, Gleb Smirnoff wrote: H> > The fix I am working on now is doing exactly that. callout_reset must H> > return 0 if the callout is currently running. H> > H> > What are the old paths impacted? H> H> Hi, H> H> I'll dig into the matter aswell and give some comments. Thanks for the H> analysis, Gleb. H> H> FYI: This class of problems wouldn't exist if the callout could be H> associated with a mutex! Exactly! I am convinced that all callouts should be locked, and non-locked one should simply go away, as well as async drain. What does prevent us from converting TCP timeouts to locked? To my understanding it is the lock order of taking pcbinfo after pcb lock. I'm now trying to understand Julien's conversion from pcbinfo lock to pcbinfo + pcblist locks. It seems to me that we actually can convert TCP timers to locked callouts. What for do we need pcbinfo read lock in a tcp timer? The timer works only on the pcb and doesn't modify global lists, does it? -- Totus tuus, Glebius. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
On 06/20/16 11:58, Gleb Smirnoff wrote: The fix I am working on now is doing exactly that. callout_reset must return 0 if the callout is currently running. What are the old paths impacted? Hi, I'll dig into the matter aswell and give some comments. Thanks for the analysis, Gleb. FYI: This class of problems wouldn't exist if the callout could be associated with a mutex! --HPS ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
On Mon, Jun 20, 2016 at 11:55:55AM +0200, Julien Charbon wrote: J> > On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: J> > J> > Comparing stable/10 and head, I see two changes that could J> > J> > affect that: J> > J> > J> > J> > - callout_async_drain J> > J> > - switch to READ lock for inp info in tcp timers J> > J> > J> > J> > That's why you are in To, Julien and Hans :) J> > J> > J> > J> > We continue investigating, and I will keep you updated. J> > J> > However, any help is welcome. I can share cores. J> > J> > Now, spending some time with cores and adding a bunch of J> > extra CTRs, I have a sequence of events that lead to the J> > panic. In short, the bug is in the callout system. It seems J> > to be not relevant to the callout_async_drain, at least for J> > now. The transition to READ lock unmasked the problem, that's J> > why NetflixBSD 10 doesn't panic. J> > J> > The panic requires heavy contention on the TCP info lock. J> > J> > [CPU 1] the callout fires, tcp_timer_keep entered J> > [CPU 1] blocks on INP_INFO_RLOCK(_tcbinfo); J> > [CPU 2] schedules the callout J> > [CPU 2] tcp_discardcb called J> > [CPU 2] callout successfully canceled J> > [CPU 2] tcpcb freed J> > [CPU 1] unblocks... panic J> > J> > When the lock was WLOCK, all contenders were resumed in a J> > sequence they came to the lock. Now, that they are readers, J> > once the lock is released, readers are resumed in a "random" J> > order, and this allows tcp_discardcb to go before the old J> > running callout, and this unmasks the panic. J> J> Highly interesting. I should be able to reproduce that (will be useful J> for testing the corresponding fix). J> J> Fix proposal: If callout_async_drain() returns 0 (fail) (instead of 1 J> (success) here) when the callout cancellation is a success _but_ the J> callout is current running, that should fix it. J> J> For the history: It comes back to my old callout question: J> J> Does _callout_stop_safe() is allowed to return 1 (success) even if the J> callout is still currently running; a.k.a. it is not because you J> successfully cancelled a callout that the callout is not currently running. J> J> We did propose a patch to make _callout_stop_safe() returns 0 (fail) J> when the callout is currently running: J> J> callout_stop() should return 0 when the callout is currently being J> serviced and indeed unstoppable J> https://reviews.freebsd.org/differential/changeset/?ref=62513=ignore-most J> J> But this change impacted too many old code paths and was interesting J> only for TCP timers and thus was abandoned. The fix I am working on now is doing exactly that. callout_reset must return 0 if the callout is currently running. What are the old paths impacted? -- Totus tuus, Glebius. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
Hi, On 6/20/16 9:39 AM, Gleb Smirnoff wrote: > On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: > J> > Comparing stable/10 and head, I see two changes that could > J> > affect that: > J> > > J> > - callout_async_drain > J> > - switch to READ lock for inp info in tcp timers > J> > > J> > That's why you are in To, Julien and Hans :) > J> > > J> > We continue investigating, and I will keep you updated. > J> > However, any help is welcome. I can share cores. > > Now, spending some time with cores and adding a bunch of > extra CTRs, I have a sequence of events that lead to the > panic. In short, the bug is in the callout system. It seems > to be not relevant to the callout_async_drain, at least for > now. The transition to READ lock unmasked the problem, that's > why NetflixBSD 10 doesn't panic. > > The panic requires heavy contention on the TCP info lock. > > [CPU 1] the callout fires, tcp_timer_keep entered > [CPU 1] blocks on INP_INFO_RLOCK(_tcbinfo); > [CPU 2] schedules the callout > [CPU 2] tcp_discardcb called > [CPU 2] callout successfully canceled > [CPU 2] tcpcb freed > [CPU 1] unblocks... panic > > When the lock was WLOCK, all contenders were resumed in a > sequence they came to the lock. Now, that they are readers, > once the lock is released, readers are resumed in a "random" > order, and this allows tcp_discardcb to go before the old > running callout, and this unmasks the panic. Highly interesting. I should be able to reproduce that (will be useful for testing the corresponding fix). Fix proposal: If callout_async_drain() returns 0 (fail) (instead of 1 (success) here) when the callout cancellation is a success _but_ the callout is current running, that should fix it. For the history: It comes back to my old callout question: Does _callout_stop_safe() is allowed to return 1 (success) even if the callout is still currently running; a.k.a. it is not because you successfully cancelled a callout that the callout is not currently running. We did propose a patch to make _callout_stop_safe() returns 0 (fail) when the callout is currently running: callout_stop() should return 0 when the callout is currently being serviced and indeed unstoppable https://reviews.freebsd.org/differential/changeset/?ref=62513=ignore-most But this change impacted too many old code paths and was interesting only for TCP timers and thus was abandoned. My 2 cents. -- Julien signature.asc Description: OpenPGP digital signature
Re: panic with tcp timers
Hi! On Fri, Jun 17, 2016 at 11:27:39AM +0200, Julien Charbon wrote: J> > Comparing stable/10 and head, I see two changes that could J> > affect that: J> > J> > - callout_async_drain J> > - switch to READ lock for inp info in tcp timers J> > J> > That's why you are in To, Julien and Hans :) J> > J> > We continue investigating, and I will keep you updated. J> > However, any help is welcome. I can share cores. Now, spending some time with cores and adding a bunch of extra CTRs, I have a sequence of events that lead to the panic. In short, the bug is in the callout system. It seems to be not relevant to the callout_async_drain, at least for now. The transition to READ lock unmasked the problem, that's why NetflixBSD 10 doesn't panic. The panic requires heavy contention on the TCP info lock. [CPU 1] the callout fires, tcp_timer_keep entered [CPU 1] blocks on INP_INFO_RLOCK(_tcbinfo); [CPU 2] schedules the callout [CPU 2] tcp_discardcb called [CPU 2] callout successfully canceled [CPU 2] tcpcb freed [CPU 1] unblocks... panic When the lock was WLOCK, all contenders were resumed in a sequence they came to the lock. Now, that they are readers, once the lock is released, readers are resumed in a "random" order, and this allows tcp_discardcb to go before the old running callout, and this unmasks the panic. Exact quote from ktrdump: 0xf82089867418 <-- faulty tcpcb 0xf82089867728 <-- its tt_keep glebius@piston:~/cores:|>grep 0xf82089867418 ktr 19012192 11 scheduled 0xf820898677a8 func 0x80628740 arg 0xf82089867418 in 20553.1e3ff819 19012190 11 rescheduled 0xf82089867728 func 0x80628bb0 arg 0xf82089867418 in 20613.04a6193d 19009551 11 rescheduled 0xf82089867728 func 0x80628bb0 arg 0xf82089867418 in 20613.042306cf 19009549 11 scheduled 0xf82089867728 func 0x80628bb0 arg 0xf82089867418 in 20563.0423409f 19009545 11 tcp_newtcpcb: 0xf82089867418 18962842 11 tcp_discardcb: free 0xf82089867418 18962830 11 tcp_discardcb: 0xf82089867418 draincnt 0 18962826 11 failed to stop 0xf820898677a8 func 0x80628740 arg 0xf82089867418 18962822 11 cancelled3 0xf82089867768 func 0x806288e0 arg 0xf82089867418 18962808 11 cancelled3 0xf82089867728 func 0x80628bb0 arg 0xf82089867418 18962804 11 failed to stop 0xf820898676e8 func 0x80628fa0 arg 0xf82089867418 18962792 11 failed to stop 0xf820898676a8 func 0x806291e0 arg 0xf82089867418 18962755 11 tcp_discardcb: 0xf82089867418 18962742 11 scheduled 0xf82089867768 func 0x806288e0 arg 0xf82089867418 in 20632.ffc8d308 18962703 11 cancelled3 0xf820898676a8 func 0x806291e0 arg 0xf82089867418 18962695 11 scheduled 0xf82089867728 func 0x80628bb0 arg 0xf82089867418 in 20612.ffc8ea28 18923275 6 tcp_timer_keep: 0xf82089867418 18923274 6 callout 0xf82089867728 func 0x80628bb0 arg 0xf82089867418 17850361 9 scheduled 0xf820898676a8 func 0x806291e0 arg 0xf82089867418 in 20553.5aec0004 -- Totus tuus, Glebius. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
Hi Gleb, On 6/17/16 6:53 AM, Gleb Smirnoff wrote: > At Netflix we are observing a race in TCP timers with head. > The problem is a regression, that doesn't happen on stable/10. > The panic usually happens after several hours at 55 Gbit/s of > traffic. > > What happens is that tcp_timer_keep finds t_tcpcb being > NULL. Some coredumps have tcpcb already initialized, > with non-NULL t_tcpcb and in TCPS_ESTABLISHED state. Which > means that other CPU was working on the tcpcb while > the faulted one was working on the panic. So, this all looks > like a use after free, which conflicts with new allocation. > > Comparing stable/10 and head, I see two changes that could > affect that: > > - callout_async_drain > - switch to READ lock for inp info in tcp timers > > That's why you are in To, Julien and Hans :) > > We continue investigating, and I will keep you updated. > However, any help is welcome. I can share cores. Thanks for sharing. Let me run our TCP tests on a recent version of HEAD to see if by chance I can reproduce it. If I am not able to reproduce it I will ask for debug kernel and cores and see if I can help. Few notes here: - Around 2 months ago I did test HEAD with callout_async_drain() in TCP timers with our TCP QA testsuite but no kernel panic. That said I did not let our test run during several hours. - At Verisign we run 10 with READ lock for inp info in tcp timers change. Again, it does not mean this change has no impact here. My 2 cents. -- Julien signature.asc Description: OpenPGP digital signature
Re: panic with tcp timers
On 17 Jun 2016, at 4:53, Gleb Smirnoff wrote: Hi! At Netflix we are observing a race in TCP timers with head. The problem is a regression, that doesn't happen on stable/10. The panic usually happens after several hours at 55 Gbit/s of traffic. What happens is that tcp_timer_keep finds t_tcpcb being NULL. Some coredumps have tcpcb already initialized, with non-NULL t_tcpcb and in TCPS_ESTABLISHED state. Which means that other CPU was working on the tcpcb while the faulted one was working on the panic. So, this all looks like a use after free, which conflicts with new allocation. Comparing stable/10 and head, I see two changes that could affect that: - callout_async_drain - switch to READ lock for inp info in tcp timers That's why you are in To, Julien and Hans :) We continue investigating, and I will keep you updated. However, any help is welcome. I can share cores. There’s also the change to no longer mark the zones NO_FREE. In theory I was convinced at the time that it should not be an issue anymore. If I had overlooked something or follow-up timer changes invalidated assumptions then that could also be trouble. That said, I was not able to get any related panics or log entries anymore lately (but I am currently slightly behind head with my branch). We should get the problem fixed however and not try to “paint over” again. /bz ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: panic with tcp timers
On 06/17/16 06:53, Gleb Smirnoff wrote: Hi! At Netflix we are observing a race in TCP timers with head. The problem is a regression, that doesn't happen on stable/10. The panic usually happens after several hours at 55 Gbit/s of traffic. What happens is that tcp_timer_keep finds t_tcpcb being NULL. Some coredumps have tcpcb already initialized, with non-NULL t_tcpcb and in TCPS_ESTABLISHED state. Which means that other CPU was working on the tcpcb while the faulted one was working on the panic. So, this all looks like a use after free, which conflicts with new allocation. Comparing stable/10 and head, I see two changes that could affect that: - callout_async_drain - switch to READ lock for inp info in tcp timers That's why you are in To, Julien and Hans :) We continue investigating, and I will keep you updated. However, any help is welcome. I can share cores. Hi, I do have projects/hps_head around, which is not that much behind 11-current, which has a completely different callout implementation. If you can reproduce the issue separately might we worth a try to rule out the callout stack. --HPS ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
panic with tcp timers
Hi! At Netflix we are observing a race in TCP timers with head. The problem is a regression, that doesn't happen on stable/10. The panic usually happens after several hours at 55 Gbit/s of traffic. What happens is that tcp_timer_keep finds t_tcpcb being NULL. Some coredumps have tcpcb already initialized, with non-NULL t_tcpcb and in TCPS_ESTABLISHED state. Which means that other CPU was working on the tcpcb while the faulted one was working on the panic. So, this all looks like a use after free, which conflicts with new allocation. Comparing stable/10 and head, I see two changes that could affect that: - callout_async_drain - switch to READ lock for inp info in tcp timers That's why you are in To, Julien and Hans :) We continue investigating, and I will keep you updated. However, any help is welcome. I can share cores. -- Totus tuus, Glebius. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"