On Mon, Jun 20, 2022 at 01:59:25PM +0200, Martin Pieuchot wrote:
> On 19/06/22(Sun) 11:34, Visa Hankala wrote:
> > On Fri, Jun 17, 2022 at 04:25:48PM +0300, Mikhail wrote:
> > > I was debugging tog in lldb and in second tmux window opened another
> > > bare tog instance, after a second I got this panic:
> > >
> > > panic: kernel diagnostic assetion "p->p_kq->kq_refcnt.r_refs == 1"
> > > failed file "/usr/src/sys/kern/kern_event.c", line 839
> > >
> > > There were also couple of xterms and chrome launched.
> > >
> > > There was an update of kern_event.c from 12 Jun - not sure if it's the
> > > fix for this panic or not.
> > >
> > > After the panic I updated to the latest snapshot, and can't reproduce it
> > > anymore, but maybe someone will have a clue.
> >
> > The 12 Jun kern_event.c commit is unrelated.
> >
> > This report shows no kernel stack trace, so I don't know if the panic
> > was caused by some unexpected thread exit path.
> >
> > However, it looks that there is a problem with kqueue_task(). Even
> > though the task holds a reference to the kqueue, the task should be
> > cleared before the kqueue is deleted. Otherwise, the kqueue's lifetime
> > can extend beyond that of the file descriptor table, causing
> > a use-after-free in KQRELE(). In addition, the task clearing should
> > avoid the unexpected reference count in kqpoll_exit().
>
> Nice catch.
>
> > The lifetime bug can be lured out by adding a brief sleep between
> > taskq_next_work() and (*work.t_func)(work.t_arg) in taskq_thread().
> > With the sleep in place, regress/sys/kern/kqueue causes the following
> > panic:
> >
> > panic: pool_do_get: fdescpl free list modified: page 0xfffffd811cf0e000;
> > item addr 0xfffffd811cf0e888; offset 0x48=0xdead4113
> > Stopped at db_enter+0x10: popq %rbp
> > TID PID UID PRFLAGS PFLAGS CPU COMMAND
> > *338246 40644 1001 0x100003 0 3K make
> > db_enter() at db_enter+0x10
> > panic(ffffffff81f841ac) at panic+0xbf
> > pool_do_get(ffffffff823c2fb8,9,ffff8000226dc904) at pool_do_get+0x35c
> > pool_get(ffffffff823c2fb8,9) at pool_get+0x96
> > fdcopy(ffff8000ffff13d0) at fdcopy+0x38
> > process_new(ffff8000ffff9500,ffff8000ffff13d0,1) at process_new+0x107
> > fork1(ffff8000ffff6d30,1,ffffffff81a6eab0,0,ffff8000226dcb00,0) at
> > fork1+0x236
> > syscall(ffff8000226dcb70) at syscall+0x374
> > Xsyscall() at Xsyscall+0x128
> > end of kernel
> > end trace frame: 0x7f7fffff28e0, count: 6
> >
> >
> > The following patch replaces the task_del() with taskq_del_barrier()
> > but also makes the barrier conditional to prior task usage. This avoids
> > the barrier invocation in the typical case where there is no kqueue
> > nesting (or poll(2)'ing or select(2)'ing of kqueues).
>
> One comment about this below.
>
> > The new barrier adds a lock order constraint. The locks that the thread
> > can hold when calling kqueue_terminate() should not be taken by tasks
> > that are run by systqmp. If this becomes a problem in the future,
> > kqueue can have its own taskq.
>
> Can this be enforced by some asserts or should we document it somewhere?
WITNESS detects locking errors with the barrier routine.
I will send a manual page patch.
> > Index: kern/kern_event.c
> > ===================================================================
> > RCS file: src/sys/kern/kern_event.c,v
> > retrieving revision 1.189
> > diff -u -p -r1.189 kern_event.c
> > --- kern/kern_event.c 12 Jun 2022 10:34:36 -0000 1.189
> > +++ kern/kern_event.c 19 Jun 2022 10:38:45 -0000
> > @@ -1581,6 +1581,7 @@ void
> > kqueue_terminate(struct proc *p, struct kqueue *kq)
> > {
> > struct knote *kn;
> > + int state;
> >
> > mtx_enter(&kq->kq_lock);
> >
> > @@ -1593,11 +1594,17 @@ kqueue_terminate(struct proc *p, struct
> > KASSERT(kn->kn_filter == EVFILT_MARKER);
> >
> > kq->kq_state |= KQ_DYING;
> > + state = kq->kq_state;
> > kqueue_wakeup(kq);
>
> Shouldn't we read `kq_state' after calling kqueue_wakeup()? Or are we
> sure this wakeup won't schedule a task?
The wakeup is there so that KQ_DYING takes effect. The task scheduling
should not happen because kq->kq_sel.si_note should be empty.
I can move the read after the wakeup call if that makes the code
clearer.
>
> > mtx_leave(&kq->kq_lock);
> >
> > + /*
> > + * Any knotes that were attached to this kqueue were deleted
> > + * by knote_fdclose() when this kqueue's file descriptor was closed.
> > + */
> > KASSERT(klist_empty(&kq->kq_sel.si_note));
> > - task_del(systqmp, &kq->kq_task);
> > + if (state & KQ_TASK)
> > + taskq_del_barrier(systqmp, &kq->kq_task);
> > }
> >
> > int
> > @@ -1623,7 +1630,6 @@ kqueue_task(void *arg)
> > mtx_enter(&kqueue_klist_lock);
> > KNOTE(&kq->kq_sel.si_note, 0);
> > mtx_leave(&kqueue_klist_lock);
> > - KQRELE(kq);
> > }
> >
> > void
> > @@ -1637,9 +1643,8 @@ kqueue_wakeup(struct kqueue *kq)
> > }
> > if (!klist_empty(&kq->kq_sel.si_note)) {
> > /* Defer activation to avoid recursion. */
> > - KQREF(kq);
> > - if (!task_add(systqmp, &kq->kq_task))
> > - KQRELE(kq);
> > + kq->kq_state |= KQ_TASK;
> > + task_add(systqmp, &kq->kq_task);
> > }
> > }
> >
> > Index: sys/eventvar.h
> > ===================================================================
> > RCS file: src/sys/sys/eventvar.h,v
> > retrieving revision 1.14
> > diff -u -p -r1.14 eventvar.h
> > --- sys/eventvar.h 16 Mar 2022 14:38:43 -0000 1.14
> > +++ sys/eventvar.h 19 Jun 2022 10:38:45 -0000
> > @@ -68,6 +68,7 @@ struct kqueue {
> > #define KQ_SEL 0x01
> > #define KQ_SLEEP 0x02
> > #define KQ_DYING 0x04
> > +#define KQ_TASK 0x08
> > };
> >
> > #endif /* !_SYS_EVENTVAR_H_ */
> >
>