On Thu, Nov 13, 2025 at 12:18:29PM +0000, Martin Pieuchot wrote:
> On 05/11/25(Wed) 15:48, Dave Voutila wrote:
> > "Piotr K. Isajew" <[email protected]> writes:
> >
> > > I wanted to install local browser extension (Extensions -> Manage
> > > extensions), but the browser process crashed. So I relaunched the
> > > browser from terminal and when I choose that option from the
> > > menu, entire OS has crashed.
> > >
> > > Attaching dmesg and screenshot showing crash.
> >
> > My quick and dirty OCR via the magic of "AI":
> >
> > uvm_fault(0xfffffd814aa2ba38, 0x60, 0, 1) -> e
> > kernel: page fault trap, code=0
> > Stopped at bread+0x33: testq $0x180,0x60(%rax)
> > TID PID UID PRFLAGS PFLAGS CPU COMMAND
> > 158743 32412 35 0x12 0 0 Xorg
> > bread(0xfffffd8125e6e8,e0,4000,fff80003ed0a1a0) at bread+0x33
> > ffs_update(fffffd80668f4700,1) at ffs_update+0xff
> > ffs_truncate(fffffd80668f4700,0,0,ffffffffffffffff) at ffs_truncate+0x615
> > ufs_inactive(fffffd8063eed1c8) at ufs_inactive+0xca
> > UOP_INACTIVE(fffffd8063eed1c8,ffff80003eb60fd8) at UOP_INACTIVE+0x4b
> > vput(fffffd8122232c0) at vput+0x60
> > vn_closefile(fffffd8125eb609,ffff80003eb60fd8) at vn_closefile+0xc3
> > fdrop(fffffd8125eb609,ffff80003eb60fd8) at fdropt+0xa1
> > closef(fffffd8125eb609,ffff80003eb60fd8) at closef+0xba
> > syscall() at syscall+0x5f9
> > syscall() at syscall+0x128
> > end trace frame: 0x7e81f564b4b, count: 4
> >
> > Looks like getblk() in vfs_bio returned NULL... corrupted filesystem
> > maybe?
>
> I don't think so. getblk(0, INFSLP) should never return NULL.
It shouldn't indeed, but it does for me.
> Piotr can you easily reproduce this bug?
Since at least Aug 30 I'd get hangs when watching youtube videos in
firefox, with variable frequency and no visual feedback. I set up AMT
console redirection to debug that, with no result. On one occasion
without AMT I got to see a partial trace similar to the above but
missed the "kernel: page fault trap" part that doesn't end up in "show
panic".
After Piotr's report I decided to test yt videos again with a -current
kernel, and very quickly the system hanged. After adding guards and a
printf in getblk() - see diff below - I could see that tsleep_nsec()
was actually returning -1 (ERESTART). Traces also show biowait()
surviving the same error. More instrumentation showed that P_SINTR
can be set at tsleep_nsec() *entry*, so even if sleep_setup() didn't
set the flag, sleep_finish() can pick it up and return ERESTART:
tsleep_nsec: returning ERESTART without PCATCH (p_sintr 128)
sleep_finish() always clears P_SINTR, so clearly that would point to a
sleep_setup(PCATCH) call with no corresponding sleep_finish(). All of
the call sites for these functions are tightly bound, except the ones
in dev/pci/drm. set_current_state() does use sleep_setup(), but
__set_current_state() partly unrolls sleep_finish() and doesn't clear
P_SINTR. I assume there were good reasons to unroll it.
Clearing P_SINTR in __set_current_state() gives me no more traces
after a few hours of testing. Clearing P_SINTR at the start of
sleep_setup() looks like sweeping the problem under the rug. Maybe we
should warn if P_SINTR is set at the start of sleep_setup()?
Here's a dump of the diag code and the tentative fix, drop the latter
if you want to get some traces.
cc'ing the usual drm and sleep/signal suspects.
Index: dev/pci/drm/drm_linux.c
===================================================================
RCS file: /cvs/src/sys/dev/pci/drm/drm_linux.c,v
diff -u -p -r1.126 drm_linux.c
--- dev/pci/drm/drm_linux.c 13 Jun 2025 07:01:37 -0000 1.126
+++ dev/pci/drm/drm_linux.c 15 Nov 2025 13:27:22 -0000
@@ -126,7 +126,7 @@ __set_current_state(int state)
SCHED_LOCK();
unsleep(p);
p->p_stat = SONPROC;
- atomic_clearbits_int(&p->p_flag, P_INSCHED);
+ atomic_clearbits_int(&p->p_flag, P_INSCHED|P_SINTR);
SCHED_UNLOCK();
}
Index: kern/kern_synch.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_synch.c,v
diff -u -p -r1.232 kern_synch.c
--- kern/kern_synch.c 18 Aug 2025 04:15:35 -0000 1.232
+++ kern/kern_synch.c 15 Nov 2025 12:03:40 -0000
@@ -114,6 +114,8 @@ int
tsleep_nsec(const volatile void *ident, int priority, const char *wmesg,
uint64_t nsecs)
{
+ int error;
+ int p_sintr = atomic_load_int(&curproc->p_flag) & P_SINTR;
#ifdef MULTIPROCESSOR
int hold_count;
#endif
@@ -150,7 +152,15 @@ tsleep_nsec(const volatile void *ident,
}
sleep_setup(ident, priority, wmesg);
- return sleep_finish(nsecs, 1);
+ error = sleep_finish(nsecs, 1);
+
+ if (error == ERESTART && !ISSET(priority, PCATCH)) {
+ printf("%s: returning ERESTART without PCATCH (p_sintr %d)\n",
+ __func__, p_sintr);
+ db_stack_dump();
+ }
+
+ return error;
}
int
@@ -176,6 +186,7 @@ msleep_nsec(const volatile void *ident,
const char *wmesg, uint64_t nsecs)
{
int error, spl;
+ int p_sintr = atomic_load_int(&curproc->p_flag) & P_SINTR;
#ifdef MULTIPROCESSOR
int hold_count;
#endif
@@ -221,6 +232,12 @@ msleep_nsec(const volatile void *ident,
if ((priority & PNORELOCK) == 0)
mtx_enter(mtx);
+ if (error == ERESTART && !ISSET(priority, PCATCH)) {
+ printf("%s: returning ERESTART without PCATCH (p_sintr %d)\n",
+ __func__, p_sintr);
+ db_stack_dump();
+ }
+
return error;
}
@@ -247,6 +264,7 @@ rwsleep_nsec(const volatile void *ident,
const char *wmesg, uint64_t nsecs)
{
int error, status;
+ int p_sintr = atomic_load_int(&curproc->p_flag) & P_SINTR;
KASSERT((priority & ~(PRIMASK | PCATCH | PNORELOCK)) == 0);
KASSERT(ident != &nowake || ISSET(priority, PCATCH) || nsecs != INFSLP);
@@ -262,6 +280,12 @@ rwsleep_nsec(const volatile void *ident,
if ((priority & PNORELOCK) == 0)
rw_enter(rwl, status);
+
+ if (error == ERESTART && !ISSET(priority, PCATCH)) {
+ printf("%s: returning ERESTART without PCATCH (p_sintr %d)\n",
+ __func__, p_sintr);
+ db_stack_dump();
+ }
return error;
}
Index: kern/vfs_bio.c
===================================================================
RCS file: /cvs/src/sys/kern/vfs_bio.c,v
diff -u -p -r1.216 vfs_bio.c
--- kern/vfs_bio.c 8 Jun 2025 06:23:43 -0000 1.216
+++ kern/vfs_bio.c 14 Nov 2025 13:11:02 -0000
@@ -1013,8 +1013,14 @@ start:
error = tsleep_nsec(bp, slpflag | (PRIBIO + 1),
"getblk", slptimeo);
splx(s);
- if (error)
+ if (error) {
+ if (slpflag == 0 && slptimeo == INFSLP) {
+ printf("%s: tsleep error %d\n",
+ __func__, error);
+ goto start;
+ }
return (NULL);
+ }
goto start;
}
--
jca