On 22/02/24(Thu) 17:24, Claudio Jeker wrote:
> On Thu, Feb 22, 2024 at 04:16:57PM +0100, Martin Pieuchot wrote:
> > On 21/02/24(Wed) 13:05, Claudio Jeker wrote:
> > > On Tue, Feb 20, 2024 at 09:34:12PM +0100, Martin Pieuchot wrote:
> > > > On 28/10/21(Thu) 05:45, Visa Hankala wrote:
> > > > > On Wed, Oct 27, 2021 at 09:02:08PM -0400, Dave Voutila wrote:
> > > > > > Dave Voutila <[email protected]> writes:
> > > > > >
> > > > > > > Was tinkering on a bt(5) script for trying to debug an issue in
> > > > > > > vmm(4)
> > > > > > > when I managed to start hitting a panic "wakeup: p_stat is 2"
> > > > > > > being
> > > > > > > triggered by kqueue coming from the softnet kernel task.
> > > > > > >
> > > > > > > I'm running an amd64 kernel built from the tree today (latest CVS
> > > > > > > commit
> > > > > > > id UynQo1r7kLKA0Q2p) with VMM_DEBUG option set and the defaults
> > > > > > > from
> > > > > > > GENERIC.MP. Userland is from the latest amd snap.
> > > > > > >
> > > > > > > To reproduce, I'm running a single OpenBSD-current guest under
> > > > > > > vmd(8)
> > > > > > > which I'm targeting with the following trivial btrace script I was
> > > > > > > working on to use for debugging something in vmm(4):
> > > > > > >
> > > > > > > tracepoint:sched:sleep / pid == $1 && tid == $2 /{
> > > > > > > printf("pid %d, tid %d slept %d!\n", pid, tid, nsecs);
> > > > > > > }
> > > > > > >
> > > > > > > tracepoint:sched:wakeup / pid == $1 && tid == $2 /{
> > > > > > > printf("pid %d, tid %d awoke %d!\n", pid, tid, nsecs);
> > > > > > > }
> > > > > >
> > > > > > Even easier reproduction: if you have 2 machines and can use
> > > > > > tcpbench(1)
> > > > > > between them, then while tcpbench is running target it with the
> > > > > > above
> > > > > > btrace script. I've found running the script, killing it with
> > > > > > ctrl-c,
> > > > > > and re-running it 2-3 times triggers the panic on my laptop.
> > > > > >
> > > > > > >
> > > > > > > Both times this happened I was trying to sysupgrade the vmd(8)
> > > > > > > guest
> > > > > > > while running the above btrace script. When I don't run the
> > > > > > > script,
> > > > > > > there is no panic.
> > > > > > >
> > > > > > > Image of the full backtrace is here: https://imgur.com/a/swW1qoj
> > > > > > >
> > > > > > > Simple transcript of the call stack after the panic() call looks
> > > > > > > like:
> > > > > > >
> > > > > > > wakeup_n
> > > > > > > kqueue_wakeup
> > > > > > > knote
> > > > > > > selwakekup
> > > > > > > tun_enqueue
> > > > > > > ether_output
> > > > > > > ip_output
> > > > > > > ip_forward
> > > > > > > ip_input_if
> > > > > > > ipv4_input
> > > > > > > ether_input
> > > > > > > if_input_process
> > > > > > >
> > > > > > > The other 3 cpu cores appeared to be in ipi handlers. (Image in
> > > > > > > that
> > > > > > > imgur link)
> > > > >
> > > > > I think the problem is recursion within the scheduler. Trace points
> > > > > invoke wakeup() directly, which is unsafe when done from within the
> > > > > run queue routines.
> > > > >
> > > > > One approach to avoid the recursion is to defer the wakeup() with
> > > > > a soft interrupt. The scheduler runs at an elevated system priority
> > > > > level that blocks soft interrupts. The deferred wakeup() is issued
> > > > > once
> > > > > the system priority level turns low enough.
> > > > >
> > > > > Unfortunately, the patch will get broken when someone adds trace
> > > > > points
> > > > > to soft interrupt scheduling...
> > > >
> > > > Thanks for the report. Sorry for the delay. I'm now really interested
> > > > in fixing this bug because I'm heavily using btrace(8) to analyse the
> > > > scheduler and I hit this panic a couple of times.
> > > >
> > > > The problem is that `pnext' might no longer be on the sleepqueue after a
> > > > tracepoint inside setrunnable() fired. Diff below fixes that by making
> > > > wakeup_n() re-entrant.
> > > >
> > > > I'm not very interested in committing this diff because it relies on a
> > > > recursive SCHED_LOCK(). Instead I'd prefer to split wakeup_n() in two
> > > > stages: first unlink the threads then call setrunnable(). This approach
> > > > will help us untangle the sleepqueue but needs a bit more shuffling,
> > > > like moving unsleep() out of setrunnable()...
> > > >
> > > > Claudio, Visa do you agree?
> > >
> > > [...]
> > >
> > > Ugh, this is too much magic for my little brain.
> > > I would prefer to unqueue the procs onto a new local list and have a
> > > special wakeup_proc for them. The code around wakeup and unsleep will need
> > > some changes to unlock it properly.
> >
> > Yes, that is what I had in mind. Diff below does that and works for
> > me(tm).
> >
> > The first loop removes threads from the sleep queue and the second put
> > them on a runqueue and optionally call tracepoints.
> >
> > Note that unsleep() has been removed from setrunnable() to reduce the
> > complexity of the code. Messing, sometimes, with the state of the
> > sleepqueue down in this path doesn't help me get a simpler picture of
> > this machinery.
> >
> > ok?
>
> I like this a lot. I think this is going in the right direction.
> Diff looks good but I only quickly looked at it.
> Untangling unsleep() from setrunnable() is a good thing (I hope it works).
I'd like to move forward with this. Any ok?
> > Index: dev/pci/drm/drm_linux.c
> > ===================================================================
> > RCS file: /cvs/src/sys/dev/pci/drm/drm_linux.c,v
> > retrieving revision 1.109
> > diff -u -p -r1.109 drm_linux.c
> > --- dev/pci/drm/drm_linux.c 21 Jan 2024 13:36:40 -0000 1.109
> > +++ dev/pci/drm/drm_linux.c 22 Feb 2024 14:32:27 -0000
> > @@ -162,7 +162,7 @@ wake_up_process(struct proc *p)
> > int s, rv;
> >
> > SCHED_LOCK(s);
> > - rv = wakeup_proc(p, NULL, 0);
> > + rv = wakeup_proc(p, 0);
> > SCHED_UNLOCK(s);
> > return rv;
> > }
> > Index: kern/kern_sig.c
> > ===================================================================
> > RCS file: /cvs/src/sys/kern/kern_sig.c,v
> > retrieving revision 1.321
> > diff -u -p -r1.321 kern_sig.c
> > --- kern/kern_sig.c 17 Jan 2024 22:22:25 -0000 1.321
> > +++ kern/kern_sig.c 22 Feb 2024 14:25:16 -0000
> > @@ -1200,6 +1200,7 @@ runfast:
> > if (p->p_usrpri > PUSER)
> > p->p_usrpri = PUSER;
> > run:
> > + unsleep(p);
> > setrunnable(p);
> > out:
> > /* finally adjust siglist */
> > @@ -2104,6 +2105,7 @@ single_thread_set(struct proc *p, int fl
> > if (q->p_flag & P_WEXIT) {
> > if (mode == SINGLE_EXIT) {
> > if (q->p_stat == SSTOP) {
> > + unsleep(q);
> > setrunnable(q);
> > atomic_inc_int(&pr->ps_singlecount);
> > }
> > @@ -2125,12 +2127,14 @@ single_thread_set(struct proc *p, int fl
> > break;
> > }
> > /* need to unwind or exit, so wake it */
> > + unsleep(q);
> > setrunnable(q);
> > }
> > atomic_inc_int(&pr->ps_singlecount);
> > break;
> > case SSTOP:
> > if (mode == SINGLE_EXIT) {
> > + unsleep(q);
> > setrunnable(q);
> > atomic_inc_int(&pr->ps_singlecount);
> > }
> > @@ -2198,9 +2202,10 @@ single_thread_clear(struct proc *p, int
> > * it back into some sleep queue
> > */
> > if (q->p_stat == SSTOP && (q->p_flag & flag) == 0) {
> > - if (q->p_wchan == NULL)
> > + if (q->p_wchan == NULL) {
> > + unsleep(q);
> > setrunnable(q);
> > - else {
> > + } else {
> > atomic_clearbits_int(&q->p_flag, P_WSLEEP);
> > q->p_stat = SSLEEP;
> > }
> > Index: kern/kern_synch.c
> > ===================================================================
> > RCS file: /cvs/src/sys/kern/kern_synch.c,v
> > retrieving revision 1.200
> > diff -u -p -r1.200 kern_synch.c
> > --- kern/kern_synch.c 13 Sep 2023 14:25:49 -0000 1.200
> > +++ kern/kern_synch.c 22 Feb 2024 14:55:05 -0000
> > @@ -467,25 +467,23 @@ sleep_signal_check(void)
> > }
> >
> > int
> > -wakeup_proc(struct proc *p, const volatile void *chan, int flags)
> > +wakeup_proc(struct proc *p, int flags)
> > {
> > int awakened = 0;
> >
> > SCHED_ASSERT_LOCKED();
> >
> > - if (p->p_wchan != NULL &&
> > - ((chan == NULL) || (p->p_wchan == chan))) {
> > + if (p->p_wchan != NULL) {
> > awakened = 1;
> > if (flags)
> > atomic_setbits_int(&p->p_flag, flags);
> > - if (p->p_stat == SSLEEP)
> > - setrunnable(p);
> > - else if (p->p_stat == SSTOP)
> > - unsleep(p);
> > #ifdef DIAGNOSTIC
> > - else
> > - panic("wakeup: p_stat is %d", (int)p->p_stat);
> > + if (p->p_stat != SSLEEP && p->p_stat != SSTOP)
> > + panic("thread %d p_stat is %d", p->p_tid, p->p_stat);
> > #endif
> > + unsleep(p);
> > + if (p->p_stat == SSLEEP)
> > + setrunnable(p);
> > }
> >
> > return awakened;
> > @@ -505,7 +503,7 @@ endtsleep(void *arg)
> > int s;
> >
> > SCHED_LOCK(s);
> > - wakeup_proc(p, NULL, P_TIMEOUT);
> > + wakeup_proc(p, P_TIMEOUT);
> > SCHED_UNLOCK(s);
> > }
> >
> > @@ -531,21 +529,35 @@ unsleep(struct proc *p)
> > void
> > wakeup_n(const volatile void *ident, int n)
> > {
> > - struct slpque *qp;
> > + struct slpque *qp, wakeq;
> > struct proc *p;
> > struct proc *pnext;
> > int s;
> >
> > + TAILQ_INIT(&wakeq);
> > +
> > SCHED_LOCK(s);
> > qp = &slpque[LOOKUP(ident)];
> > for (p = TAILQ_FIRST(qp); p != NULL && n != 0; p = pnext) {
> > pnext = TAILQ_NEXT(p, p_runq);
> > #ifdef DIAGNOSTIC
> > if (p->p_stat != SSLEEP && p->p_stat != SSTOP)
> > - panic("wakeup: p_stat is %d", (int)p->p_stat);
> > + panic("thread %d p_stat is %d", p->p_tid, p->p_stat);
> > #endif
> > - if (wakeup_proc(p, ident, 0))
> > + KASSERT(p->p_wchan != NULL);
> > + if (p->p_wchan == ident) {
> > + TAILQ_REMOVE(qp, p, p_runq);
> > + p->p_wchan = NULL;
> > + TAILQ_INSERT_TAIL(&wakeq, p, p_runq);
> > --n;
> > + }
> > + }
> > + while ((p = TAILQ_FIRST(&wakeq))) {
> > + TAILQ_REMOVE(&wakeq, p, p_runq);
> > + TRACEPOINT(sched, unsleep, p->p_tid + THREAD_PID_OFFSET,
> > + p->p_p->ps_pid);
> > + if (p->p_stat == SSLEEP)
> > + setrunnable(p);
> > }
> > SCHED_UNLOCK(s);
> > }
> > Index: kern/sched_bsd.c
> > ===================================================================
> > RCS file: /cvs/src/sys/kern/sched_bsd.c,v
> > retrieving revision 1.90
> > diff -u -p -r1.90 sched_bsd.c
> > --- kern/sched_bsd.c 24 Jan 2024 19:23:38 -0000 1.90
> > +++ kern/sched_bsd.c 22 Feb 2024 14:24:31 -0000
> > @@ -500,12 +500,10 @@ setrunnable(struct proc *p)
> > if ((pr->ps_flags & PS_TRACED) != 0 && pr->ps_xsig != 0)
> > atomic_setbits_int(&p->p_siglist, sigmask(pr->ps_xsig));
> > prio = p->p_usrpri;
> > - unsleep(p);
> > setrunqueue(NULL, p, prio);
> > break;
> > case SSLEEP:
> > prio = p->p_slppri;
> > - unsleep(p); /* e.g. when sending signals */
> >
> > /* if not yet asleep, don't add to runqueue */
> > if (ISSET(p->p_flag, P_WSLEEP))
> > Index: kern/sys_process.c
> > ===================================================================
> > RCS file: /cvs/src/sys/kern/sys_process.c,v
> > retrieving revision 1.95
> > diff -u -p -r1.95 sys_process.c
> > --- kern/sys_process.c 21 Nov 2023 14:00:13 -0000 1.95
> > +++ kern/sys_process.c 22 Feb 2024 14:25:34 -0000
> > @@ -493,6 +493,7 @@ ptrace_ctrl(struct proc *p, int req, pid
> > if (t->p_stat == SSTOP) {
> > tr->ps_xsig = data;
> > SCHED_LOCK(s);
> > + unsleep(t);
> > setrunnable(t);
> > SCHED_UNLOCK(s);
> > } else {
> > Index: sys/proc.h
> > ===================================================================
> > RCS file: /cvs/src/sys/sys/proc.h,v
> > retrieving revision 1.356
> > diff -u -p -r1.356 proc.h
> > --- sys/proc.h 3 Feb 2024 18:51:58 -0000 1.356
> > +++ sys/proc.h 22 Feb 2024 14:32:13 -0000
> > @@ -555,7 +555,7 @@ void procinit(void);
> > void setpriority(struct proc *, uint32_t, uint8_t);
> > void setrunnable(struct proc *);
> > void endtsleep(void *);
> > -int wakeup_proc(struct proc *, const volatile void *, int);
> > +int wakeup_proc(struct proc *, int);
> > void unsleep(struct proc *);
> > void reaper(void *);
> > __dead void exit1(struct proc *, int, int, int);
>
> --
> :wq Claudio
>