On 30/04/19(Tue) 08:06, David Gwynne wrote:
> 
> 
> > On 30 Apr 2019, at 03:24, Martin Pieuchot <[email protected]> wrote:
> > 
> > On 29/04/19(Mon) 17:24, David Gwynne wrote:
> >> On Sun, Apr 28, 2019 at 06:57:02PM -0300, Martin Pieuchot wrote:
> >>> On 23/04/19(Tue) 12:16, Olivier Antoine wrote:
> >>>>> Synopsis:    panic: Stopped at kqueue_scan
> >>>>> Category:    kernel i386
> >>>>> Environment:
> >>>>    System      : OpenBSD 6.5
> >>>>    Details     : OpenBSD 6.5-current (GENERIC.MP) #1368: Sun Apr 21
> >>>> 19:50:46 MDT 2019
> >>>>             
> >>>> [email protected]:/usr/src/sys/arch/i386/compile/GENERIC.MP
> >>>> 
> >>>>    Architecture: OpenBSD.i386
> >>>>    Machine     : i386
> >>>>> Description:
> >>>> Hi, since my last update I have regular panic crashes. 4 in two days.
> >>>> At least 3 of them, with certainty, occurred while I was accessing the
> >>>> Internet via my smartphone connected to my OpenBSD WiFi access point
> >>>> through my Allways-on VPN isakmp/ipsec/nppp relaying traffic in Tor.
> >>>> This setup works for years.
> >>>> 
> >>>> The machine then displays something like:
> >>>> uvm_fault(0xd34e5f3c, 0x0, 0, 2) -> e
> >>>> kernel: page fault trap, code=0
> >>>> Stopped at kqueue_scan+0x246: movl %eax,0(%ecx)
> >>>> ddb{1}>
> >>> 
> >>> So this indicates that the `kqueue' is empty.  It should not happen
> >>> because the caller, in your case npppd, always places a marker in the
> >>> list.
> >>> 
> >>> Since the caller is not threaded and the syscall is executed with the
> >>> KERNEL_LOCK() held, we can supposed that another part of the kernel is
> >>> removing the marker.  That would imply that the other part isn't running
> >>> with the KERNEL_LOCK() and requires a MP kernel.
> >>> 
> >>> Could you try *very hard* to reproduce the problem with a kernel built
> >>> with the diff below?  Hopefully you'll make it crash and we'll find the
> >>> bug.  Otherwise we'll look for another possible cause of the marker
> >>> removal.
> >>> 
> >>> Index: kern/kern_event.c
> >>> ===================================================================
> >>> RCS file: /cvs/src/sys/kern/kern_event.c,v
> >>> retrieving revision 1.101
> >>> diff -u -p -r1.101 kern_event.c
> >>> --- kern/kern_event.c     27 Nov 2018 15:52:50 -0000      1.101
> >>> +++ kern/kern_event.c     28 Apr 2019 21:47:25 -0000
> >>> @@ -1052,6 +1052,8 @@ knote_drop(struct knote *kn, struct proc
> >>>   struct kqueue *kq = kn->kn_kq;
> >>>   struct klist *list;
> >>> 
> >>> + KERNEL_ASSERT_LOCKED();
> >>> +
> >>>   if (kn->kn_fop->f_isfd)
> >>>           list = &kq->kq_knlist[kn->kn_id];
> >>>   else
> >>> 
> >> 
> >> i had a similar diff in my tree, and with some clues from this thread
> >> and the same panic from jmc@, points out that tun(4) calls tun_wakeup
> >> without KERNEL_LOCK. tun_wakeup calls selwakeup, which ends up in
> >> the kq code messing up the kn_head list. fun fun.
> >> 
> >> below is an extremely clever diff to tun to avoid doing the wakeup
> >> without the kernel lock held. i say extremely clever because tun(4) is
> >> not marked as MPSAFE, which means the if_start handler gets called with
> >> KERNEL_LOCK taken by the stack on its behalf. tun_start then calls
> >> tun_wakeup with that implicit KERNEL_LOCK hold.
> >> 
> >> i don't know why this hasn't blown up before. the network stack
> >> hasn't been run with the KERNEL_LOCK for ages now.
> >> 
> >> tun_output with a custom if_enqueue handler would be a lot smarter,
> >> but that is a more invasive diff.
> > 
> > I'd prefer if we could grab the KERNEK_LOCK() around csignal() and
> > selwakeup() like it is done in the socket code.  I believe that we
> > need to show where the lock is taken in order to help people realize
> > where it needs to be pushed down.
> 
> I should have been more clear, but this diff is to fix the tree, it's by no 
> means a final fix. tun(4) itself could do with some further changes, 
> including the ones you describe above.
> 
> If you prefer I could just add KERNEL_LOCK to tun_wakeup?

I'd prefer :)

> > Can we solve both issues with a similar fix?  How hard can it be to get
> > selwakeup() out of the KERNEL_LOCK()?
> 
> Figuring that out was going to go onto my todo.

\o/
 
> > Should we add a KASSERT in csignal() too?
> 
> If it currently needs the kernel lock then it should have the assert for it. 
> Otherwise we have issues like these waiting days or weeks for fixes.

Go for it.

Reply via email to