On 30/04/19(Tue) 08:06, David Gwynne wrote: > > > > On 30 Apr 2019, at 03:24, Martin Pieuchot <[email protected]> wrote: > > > > On 29/04/19(Mon) 17:24, David Gwynne wrote: > >> On Sun, Apr 28, 2019 at 06:57:02PM -0300, Martin Pieuchot wrote: > >>> On 23/04/19(Tue) 12:16, Olivier Antoine wrote: > >>>>> Synopsis: panic: Stopped at kqueue_scan > >>>>> Category: kernel i386 > >>>>> Environment: > >>>> System : OpenBSD 6.5 > >>>> Details : OpenBSD 6.5-current (GENERIC.MP) #1368: Sun Apr 21 > >>>> 19:50:46 MDT 2019 > >>>> > >>>> [email protected]:/usr/src/sys/arch/i386/compile/GENERIC.MP > >>>> > >>>> Architecture: OpenBSD.i386 > >>>> Machine : i386 > >>>>> Description: > >>>> Hi, since my last update I have regular panic crashes. 4 in two days. > >>>> At least 3 of them, with certainty, occurred while I was accessing the > >>>> Internet via my smartphone connected to my OpenBSD WiFi access point > >>>> through my Allways-on VPN isakmp/ipsec/nppp relaying traffic in Tor. > >>>> This setup works for years. > >>>> > >>>> The machine then displays something like: > >>>> uvm_fault(0xd34e5f3c, 0x0, 0, 2) -> e > >>>> kernel: page fault trap, code=0 > >>>> Stopped at kqueue_scan+0x246: movl %eax,0(%ecx) > >>>> ddb{1}> > >>> > >>> So this indicates that the `kqueue' is empty. It should not happen > >>> because the caller, in your case npppd, always places a marker in the > >>> list. > >>> > >>> Since the caller is not threaded and the syscall is executed with the > >>> KERNEL_LOCK() held, we can supposed that another part of the kernel is > >>> removing the marker. That would imply that the other part isn't running > >>> with the KERNEL_LOCK() and requires a MP kernel. > >>> > >>> Could you try *very hard* to reproduce the problem with a kernel built > >>> with the diff below? Hopefully you'll make it crash and we'll find the > >>> bug. Otherwise we'll look for another possible cause of the marker > >>> removal. > >>> > >>> Index: kern/kern_event.c > >>> =================================================================== > >>> RCS file: /cvs/src/sys/kern/kern_event.c,v > >>> retrieving revision 1.101 > >>> diff -u -p -r1.101 kern_event.c > >>> --- kern/kern_event.c 27 Nov 2018 15:52:50 -0000 1.101 > >>> +++ kern/kern_event.c 28 Apr 2019 21:47:25 -0000 > >>> @@ -1052,6 +1052,8 @@ knote_drop(struct knote *kn, struct proc > >>> struct kqueue *kq = kn->kn_kq; > >>> struct klist *list; > >>> > >>> + KERNEL_ASSERT_LOCKED(); > >>> + > >>> if (kn->kn_fop->f_isfd) > >>> list = &kq->kq_knlist[kn->kn_id]; > >>> else > >>> > >> > >> i had a similar diff in my tree, and with some clues from this thread > >> and the same panic from jmc@, points out that tun(4) calls tun_wakeup > >> without KERNEL_LOCK. tun_wakeup calls selwakeup, which ends up in > >> the kq code messing up the kn_head list. fun fun. > >> > >> below is an extremely clever diff to tun to avoid doing the wakeup > >> without the kernel lock held. i say extremely clever because tun(4) is > >> not marked as MPSAFE, which means the if_start handler gets called with > >> KERNEL_LOCK taken by the stack on its behalf. tun_start then calls > >> tun_wakeup with that implicit KERNEL_LOCK hold. > >> > >> i don't know why this hasn't blown up before. the network stack > >> hasn't been run with the KERNEL_LOCK for ages now. > >> > >> tun_output with a custom if_enqueue handler would be a lot smarter, > >> but that is a more invasive diff. > > > > I'd prefer if we could grab the KERNEK_LOCK() around csignal() and > > selwakeup() like it is done in the socket code. I believe that we > > need to show where the lock is taken in order to help people realize > > where it needs to be pushed down. > > I should have been more clear, but this diff is to fix the tree, it's by no > means a final fix. tun(4) itself could do with some further changes, > including the ones you describe above. > > If you prefer I could just add KERNEL_LOCK to tun_wakeup?
I'd prefer :) > > Can we solve both issues with a similar fix? How hard can it be to get > > selwakeup() out of the KERNEL_LOCK()? > > Figuring that out was going to go onto my todo. \o/ > > Should we add a KASSERT in csignal() too? > > If it currently needs the kernel lock then it should have the assert for it. > Otherwise we have issues like these waiting days or weeks for fixes. Go for it.
