On Thu, Feb 22, 2024 at 04:16:57PM +0100, Martin Pieuchot wrote:
> On 21/02/24(Wed) 13:05, Claudio Jeker wrote:
> > On Tue, Feb 20, 2024 at 09:34:12PM +0100, Martin Pieuchot wrote:
> > > On 28/10/21(Thu) 05:45, Visa Hankala wrote:
> > > > On Wed, Oct 27, 2021 at 09:02:08PM -0400, Dave Voutila wrote:
> > > > > Dave Voutila <[email protected]> writes:
> > > > >
> > > > > > Was tinkering on a bt(5) script for trying to debug an issue in
> > > > > > vmm(4)
> > > > > > when I managed to start hitting a panic "wakeup: p_stat is 2" being
> > > > > > triggered by kqueue coming from the softnet kernel task.
> > > > > >
> > > > > > I'm running an amd64 kernel built from the tree today (latest CVS
> > > > > > commit
> > > > > > id UynQo1r7kLKA0Q2p) with VMM_DEBUG option set and the defaults from
> > > > > > GENERIC.MP. Userland is from the latest amd snap.
> > > > > >
> > > > > > To reproduce, I'm running a single OpenBSD-current guest under
> > > > > > vmd(8)
> > > > > > which I'm targeting with the following trivial btrace script I was
> > > > > > working on to use for debugging something in vmm(4):
> > > > > >
> > > > > > tracepoint:sched:sleep / pid == $1 && tid == $2 /{
> > > > > > printf("pid %d, tid %d slept %d!\n", pid, tid, nsecs);
> > > > > > }
> > > > > >
> > > > > > tracepoint:sched:wakeup / pid == $1 && tid == $2 /{
> > > > > > printf("pid %d, tid %d awoke %d!\n", pid, tid, nsecs);
> > > > > > }
> > > > >
> > > > > Even easier reproduction: if you have 2 machines and can use
> > > > > tcpbench(1)
> > > > > between them, then while tcpbench is running target it with the above
> > > > > btrace script. I've found running the script, killing it with ctrl-c,
> > > > > and re-running it 2-3 times triggers the panic on my laptop.
> > > > >
> > > > > >
> > > > > > Both times this happened I was trying to sysupgrade the vmd(8) guest
> > > > > > while running the above btrace script. When I don't run the script,
> > > > > > there is no panic.
> > > > > >
> > > > > > Image of the full backtrace is here: https://imgur.com/a/swW1qoj
> > > > > >
> > > > > > Simple transcript of the call stack after the panic() call looks
> > > > > > like:
> > > > > >
> > > > > > wakeup_n
> > > > > > kqueue_wakeup
> > > > > > knote
> > > > > > selwakekup
> > > > > > tun_enqueue
> > > > > > ether_output
> > > > > > ip_output
> > > > > > ip_forward
> > > > > > ip_input_if
> > > > > > ipv4_input
> > > > > > ether_input
> > > > > > if_input_process
> > > > > >
> > > > > > The other 3 cpu cores appeared to be in ipi handlers. (Image in that
> > > > > > imgur link)
> > > >
> > > > I think the problem is recursion within the scheduler. Trace points
> > > > invoke wakeup() directly, which is unsafe when done from within the
> > > > run queue routines.
> > > >
> > > > One approach to avoid the recursion is to defer the wakeup() with
> > > > a soft interrupt. The scheduler runs at an elevated system priority
> > > > level that blocks soft interrupts. The deferred wakeup() is issued once
> > > > the system priority level turns low enough.
> > > >
> > > > Unfortunately, the patch will get broken when someone adds trace points
> > > > to soft interrupt scheduling...
> > >
> > > Thanks for the report. Sorry for the delay. I'm now really interested
> > > in fixing this bug because I'm heavily using btrace(8) to analyse the
> > > scheduler and I hit this panic a couple of times.
> > >
> > > The problem is that `pnext' might no longer be on the sleepqueue after a
> > > tracepoint inside setrunnable() fired. Diff below fixes that by making
> > > wakeup_n() re-entrant.
> > >
> > > I'm not very interested in committing this diff because it relies on a
> > > recursive SCHED_LOCK(). Instead I'd prefer to split wakeup_n() in two
> > > stages: first unlink the threads then call setrunnable(). This approach
> > > will help us untangle the sleepqueue but needs a bit more shuffling,
> > > like moving unsleep() out of setrunnable()...
> > >
> > > Claudio, Visa do you agree?
> >
> > [...]
> >
> > Ugh, this is too much magic for my little brain.
> > I would prefer to unqueue the procs onto a new local list and have a
> > special wakeup_proc for them. The code around wakeup and unsleep will need
> > some changes to unlock it properly.
>
> Yes, that is what I had in mind. Diff below does that and works for
> me(tm).
>
> The first loop removes threads from the sleep queue and the second put
> them on a runqueue and optionally call tracepoints.
>
> Note that unsleep() has been removed from setrunnable() to reduce the
> complexity of the code. Messing, sometimes, with the state of the
> sleepqueue down in this path doesn't help me get a simpler picture of
> this machinery.
>
> ok?
I like this a lot. I think this is going in the right direction.
Diff looks good but I only quickly looked at it.
Untangling unsleep() from setrunnable() is a good thing (I hope it works).
> Index: dev/pci/drm/drm_linux.c
> ===================================================================
> RCS file: /cvs/src/sys/dev/pci/drm/drm_linux.c,v
> retrieving revision 1.109
> diff -u -p -r1.109 drm_linux.c
> --- dev/pci/drm/drm_linux.c 21 Jan 2024 13:36:40 -0000 1.109
> +++ dev/pci/drm/drm_linux.c 22 Feb 2024 14:32:27 -0000
> @@ -162,7 +162,7 @@ wake_up_process(struct proc *p)
> int s, rv;
>
> SCHED_LOCK(s);
> - rv = wakeup_proc(p, NULL, 0);
> + rv = wakeup_proc(p, 0);
> SCHED_UNLOCK(s);
> return rv;
> }
> Index: kern/kern_sig.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_sig.c,v
> retrieving revision 1.321
> diff -u -p -r1.321 kern_sig.c
> --- kern/kern_sig.c 17 Jan 2024 22:22:25 -0000 1.321
> +++ kern/kern_sig.c 22 Feb 2024 14:25:16 -0000
> @@ -1200,6 +1200,7 @@ runfast:
> if (p->p_usrpri > PUSER)
> p->p_usrpri = PUSER;
> run:
> + unsleep(p);
> setrunnable(p);
> out:
> /* finally adjust siglist */
> @@ -2104,6 +2105,7 @@ single_thread_set(struct proc *p, int fl
> if (q->p_flag & P_WEXIT) {
> if (mode == SINGLE_EXIT) {
> if (q->p_stat == SSTOP) {
> + unsleep(q);
> setrunnable(q);
> atomic_inc_int(&pr->ps_singlecount);
> }
> @@ -2125,12 +2127,14 @@ single_thread_set(struct proc *p, int fl
> break;
> }
> /* need to unwind or exit, so wake it */
> + unsleep(q);
> setrunnable(q);
> }
> atomic_inc_int(&pr->ps_singlecount);
> break;
> case SSTOP:
> if (mode == SINGLE_EXIT) {
> + unsleep(q);
> setrunnable(q);
> atomic_inc_int(&pr->ps_singlecount);
> }
> @@ -2198,9 +2202,10 @@ single_thread_clear(struct proc *p, int
> * it back into some sleep queue
> */
> if (q->p_stat == SSTOP && (q->p_flag & flag) == 0) {
> - if (q->p_wchan == NULL)
> + if (q->p_wchan == NULL) {
> + unsleep(q);
> setrunnable(q);
> - else {
> + } else {
> atomic_clearbits_int(&q->p_flag, P_WSLEEP);
> q->p_stat = SSLEEP;
> }
> Index: kern/kern_synch.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_synch.c,v
> retrieving revision 1.200
> diff -u -p -r1.200 kern_synch.c
> --- kern/kern_synch.c 13 Sep 2023 14:25:49 -0000 1.200
> +++ kern/kern_synch.c 22 Feb 2024 14:55:05 -0000
> @@ -467,25 +467,23 @@ sleep_signal_check(void)
> }
>
> int
> -wakeup_proc(struct proc *p, const volatile void *chan, int flags)
> +wakeup_proc(struct proc *p, int flags)
> {
> int awakened = 0;
>
> SCHED_ASSERT_LOCKED();
>
> - if (p->p_wchan != NULL &&
> - ((chan == NULL) || (p->p_wchan == chan))) {
> + if (p->p_wchan != NULL) {
> awakened = 1;
> if (flags)
> atomic_setbits_int(&p->p_flag, flags);
> - if (p->p_stat == SSLEEP)
> - setrunnable(p);
> - else if (p->p_stat == SSTOP)
> - unsleep(p);
> #ifdef DIAGNOSTIC
> - else
> - panic("wakeup: p_stat is %d", (int)p->p_stat);
> + if (p->p_stat != SSLEEP && p->p_stat != SSTOP)
> + panic("thread %d p_stat is %d", p->p_tid, p->p_stat);
> #endif
> + unsleep(p);
> + if (p->p_stat == SSLEEP)
> + setrunnable(p);
> }
>
> return awakened;
> @@ -505,7 +503,7 @@ endtsleep(void *arg)
> int s;
>
> SCHED_LOCK(s);
> - wakeup_proc(p, NULL, P_TIMEOUT);
> + wakeup_proc(p, P_TIMEOUT);
> SCHED_UNLOCK(s);
> }
>
> @@ -531,21 +529,35 @@ unsleep(struct proc *p)
> void
> wakeup_n(const volatile void *ident, int n)
> {
> - struct slpque *qp;
> + struct slpque *qp, wakeq;
> struct proc *p;
> struct proc *pnext;
> int s;
>
> + TAILQ_INIT(&wakeq);
> +
> SCHED_LOCK(s);
> qp = &slpque[LOOKUP(ident)];
> for (p = TAILQ_FIRST(qp); p != NULL && n != 0; p = pnext) {
> pnext = TAILQ_NEXT(p, p_runq);
> #ifdef DIAGNOSTIC
> if (p->p_stat != SSLEEP && p->p_stat != SSTOP)
> - panic("wakeup: p_stat is %d", (int)p->p_stat);
> + panic("thread %d p_stat is %d", p->p_tid, p->p_stat);
> #endif
> - if (wakeup_proc(p, ident, 0))
> + KASSERT(p->p_wchan != NULL);
> + if (p->p_wchan == ident) {
> + TAILQ_REMOVE(qp, p, p_runq);
> + p->p_wchan = NULL;
> + TAILQ_INSERT_TAIL(&wakeq, p, p_runq);
> --n;
> + }
> + }
> + while ((p = TAILQ_FIRST(&wakeq))) {
> + TAILQ_REMOVE(&wakeq, p, p_runq);
> + TRACEPOINT(sched, unsleep, p->p_tid + THREAD_PID_OFFSET,
> + p->p_p->ps_pid);
> + if (p->p_stat == SSLEEP)
> + setrunnable(p);
> }
> SCHED_UNLOCK(s);
> }
> Index: kern/sched_bsd.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/sched_bsd.c,v
> retrieving revision 1.90
> diff -u -p -r1.90 sched_bsd.c
> --- kern/sched_bsd.c 24 Jan 2024 19:23:38 -0000 1.90
> +++ kern/sched_bsd.c 22 Feb 2024 14:24:31 -0000
> @@ -500,12 +500,10 @@ setrunnable(struct proc *p)
> if ((pr->ps_flags & PS_TRACED) != 0 && pr->ps_xsig != 0)
> atomic_setbits_int(&p->p_siglist, sigmask(pr->ps_xsig));
> prio = p->p_usrpri;
> - unsleep(p);
> setrunqueue(NULL, p, prio);
> break;
> case SSLEEP:
> prio = p->p_slppri;
> - unsleep(p); /* e.g. when sending signals */
>
> /* if not yet asleep, don't add to runqueue */
> if (ISSET(p->p_flag, P_WSLEEP))
> Index: kern/sys_process.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/sys_process.c,v
> retrieving revision 1.95
> diff -u -p -r1.95 sys_process.c
> --- kern/sys_process.c 21 Nov 2023 14:00:13 -0000 1.95
> +++ kern/sys_process.c 22 Feb 2024 14:25:34 -0000
> @@ -493,6 +493,7 @@ ptrace_ctrl(struct proc *p, int req, pid
> if (t->p_stat == SSTOP) {
> tr->ps_xsig = data;
> SCHED_LOCK(s);
> + unsleep(t);
> setrunnable(t);
> SCHED_UNLOCK(s);
> } else {
> Index: sys/proc.h
> ===================================================================
> RCS file: /cvs/src/sys/sys/proc.h,v
> retrieving revision 1.356
> diff -u -p -r1.356 proc.h
> --- sys/proc.h 3 Feb 2024 18:51:58 -0000 1.356
> +++ sys/proc.h 22 Feb 2024 14:32:13 -0000
> @@ -555,7 +555,7 @@ void procinit(void);
> void setpriority(struct proc *, uint32_t, uint8_t);
> void setrunnable(struct proc *);
> void endtsleep(void *);
> -int wakeup_proc(struct proc *, const volatile void *, int);
> +int wakeup_proc(struct proc *, int);
> void unsleep(struct proc *);
> void reaper(void *);
> __dead void exit1(struct proc *, int, int, int);
--
:wq Claudio