Hi there,

There seems to be some racy code in kern_event.c which is causing me to run into some crashes. I’ve attached the test program used to generate these crashes (build it and run the “go” script). They were produced in a VM with 4 cores on 11 Alpha 3 (and originally 10.3). The crashes I’ve seen come in a few varieties:

1. “userret: returning with the following locks held”. This one is the easiest to hit (assuming witness is enabled).

userret: returning with the following locks held:
exclusive sleep mutex process lock (process lock) r = 0 (0xfffff80006956120) locked @ /usr/src/sys/kern/kern_event.c:2125
panic: witness_warn
cpuid = 2
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe000039d8e0
vpanic() at vpanic+0x182/frame 0xfffffe000039d960
kassert_panic() at kassert_panic+0x126/frame 0xfffffe000039d9d0
witness_warn() at witness_warn+0x3c6/frame 0xfffffe000039daa0
userret() at userret+0x9d/frame 0xfffffe000039dae0
amd64_syscall() at amd64_syscall+0x406/frame 0xfffffe000039dbf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe000039dbf0
--- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x800b8a0ba, rsp = 0x7fffffffea98, rbp = 0x7fffffffeae0 ---
KDB: enter: panic
[ thread pid 64855 tid 100106 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why
db> show all locks
Process 64855 (watch) thread 0xfffff800066c3000 (100106)
exclusive sleep mutex process lock (process lock) r = 0 (0xfffff80006956120) locked @ /usr/src/sys/kern/kern_event.c:2125
Process 64855 (watch) thread 0xfffff8000696a500 (100244)
exclusive sleep mutex pmap (pmap) r = 0 (0xfffff800068c3138) locked @ /usr/src/sys/amd64/amd64/pmap.c:4067 exclusive sx vm map (user) (vm map (user)) r = 0 (0xfffff800068f6080) locked @ /usr/src/sys/vm/vm_map.c:3315 exclusive sx vm map (user) (vm map (user)) r = 0 (0xfffff800068c3080) locked @ /usr/src/sys/vm/vm_map.c:3311
db> ps
  pid  ppid  pgrp   uid   state   wmesg         wchan        cmd
64855   690   690     0  R+      (threaded)                  watch
100106                   Run     CPU 2                       main
100244                   Run     CPU 1 procmaker
100245                   Run     CPU 3 reaper

2. “Sleeping thread owns a non-sleepable lock”. This one first drew my attention by showing up in a real world application at work.

Sleeping thread (tid 100101, pid 76857) owns a non-sleepable lock
KDB: stack backtrace of thread 100101:
sched_switch() at sched_switch+0x2a5/frame 0xfffffe0000257690
mi_switch() at mi_switch+0xe1/frame 0xfffffe00002576d0
sleepq_catch_signals() at sleepq_catch_signals+0x16c/frame 0xfffffe0000257730
sleepq_timedwait_sig() at sleepq_timedwait_sig+0xf/frame 0xfffffe0000257760
_sleep() at _sleep+0x234/frame 0xfffffe00002577e0
kern_kevent_fp() at kern_kevent_fp+0x38a/frame 0xfffffe00002579d0
kern_kevent() at kern_kevent+0x9f/frame 0xfffffe0000257a30
sys_kevent() at sys_kevent+0x12a/frame 0xfffffe0000257ae0
amd64_syscall() at amd64_syscall+0x2d4/frame 0xfffffe0000257bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0000257bf0
--- syscall (363, FreeBSD ELF64, sys_kevent), rip = 0x800b6afea, rsp = 0x7fffffffea88, rbp = 0x7fffffffead0 ---
panic: sleeping thread
cpuid = 3
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0000225590
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0000225640
vpanic() at vpanic+0x126/frame 0xfffffe0000225680
panic() at panic+0x43/frame 0xfffffe00002256e0
propagate_priority() at propagate_priority+0x166/frame 0xfffffe0000225710
turnstile_wait() at turnstile_wait+0x282/frame 0xfffffe0000225750
__mtx_lock_sleep() at __mtx_lock_sleep+0x26b/frame 0xfffffe00002257d0
__mtx_lock_flags() at __mtx_lock_flags+0x5e/frame 0xfffffe00002257f0
proc_to_reap() at proc_to_reap+0x46/frame 0xfffffe0000225840
kern_wait6() at kern_wait6+0x202/frame 0xfffffe00002258f0
sys_wait4() at sys_wait4+0x72/frame 0xfffffe0000225ae0
amd64_syscall() at amd64_syscall+0x2d4/frame 0xfffffe0000225bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0000225bf0
--- syscall (7, FreeBSD ELF64, sys_wait4), rip = 0x800b209ba, rsp = 0x7fffdfdfcf48, rbp = 0x7fffdfdfcf80 ---
KDB: enter: panic
[ thread pid 76857 tid 100225 ]
Stopped at      kdb_enter+0x3e: movq    $0,kdb_why
db> show allchains
chain 1:
thread 100225 (pid 76857, reaper) blocked on lock 0xfffff800413105f0 (sleep mutex) "process lock"
 thread 100101 (pid 76857, main) inhibited

(3./4.) There are a few others that I hit less frequently (“page fault while in kernel mode”, "Kernel page fault with the following non-sleepable locks held”. I don’t have a backtrace handy for these.

I believe they all have more or less the same cause. The crashes occur because we acquire a knlist lock via the KN_LIST_LOCK macro, but when we call KN_LIST_UNLOCK, the knote’s knlist reference (kn->kn_knlist) has been cleared by another thread. Thus we are unable to unlock the previously acquired lock and hold it until something causes us to crash (such as the witness code noticing that we’re returning to userland with the lock still held).

A walkthrough of what happens in the test program:

There are 3 threads: 1 forks off short-lived child processes, 2 reaps the child processes, and 3 tracks the child processes via a kqueue (NOTE_EXIT | NOTE_EXEC | NOTE_FORK | NOTE_TRACK). I believe a crash generally looks like this:

1. Forker thread creates a short lived child. That child dies and triggers a NOTE_EXIT event. 2. Kqueue thread is somewhere in kqueue_scan(), probably blocked at a KN_LIST_LOCK call. 3. The dying process calls into filt_proc() and notices that the KN_DETACHED flag is not set. It therefore decides to call knlist_remove_inevent() to take the knote out of the knlist. Importantly, this sets kn->kn_knlist to NULL, meaning we can no longer access the knlist lock from the knote. 4. Kqueue thread, still in kqueue_scan(), is able to acquire the lock via KN_LIST_LOCK. It does some work and then calls the KN_LIST_UNLOCK macro. This macro checks and finds that the knote does not have a reference to a knlist, and thus takes no action, leaving the lock in the locked state.

I believe there’s also a small window where the KN_LIST_LOCK macro checks kn->kn_knlist and finds it to be non-NULL, but by the time it actually dereferences it, it has become NULL. This would produce the “page fault while in kernel mode” crash.

If someone familiar with this code sees an obvious fix, I’ll be happy to test it. Otherwise, I’d appreciate any advice on fixing this. My first thought is that a ‘struct knote’ ought to have its own mutex for controlling access to the flag fields and ideally the “kn_knlist” field. I.e., you would first acquire a knote’s lock and then the knlist lock, thus ensuring that no one could clear the kn_knlist variable while you hold the knlist lock. The knlist lock, however, usually comes from whichever event producing entity the knote tracks, so getting lock ordering right between the per-knote mutex and this other lock seems potentially hard. (Sometimes we call into functions in kern_event.c with the knlist lock already held, having been acquired in code outside of kern_event.c. Consider, for example, calling KNOTE_LOCKED from kern_exit.c; the PROC_LOCK macro has already been used to acquire the process lock, also serving as the knlist lock).

Apropos of the knlist lock and its provenance: why is a lock from the event producing entity used to control access to the knlist and knote? Is it generally desirable to, for example, hold the process lock while operating on a knlist attached to that process? It’s not obvious to me that this is required or even desirable. This might suggest that a knlist should have its own lock rather than using a lock from the event producing entity, which might make addressing this problem more straightforward.

Many thanks for any help, and please let me know if I’ve failed to make anything clear.


freebsd-current@freebsd.org mailing list
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Reply via email to