On Mon, Dec 30, 2019 at 3:16 PM Dave Voutila <[email protected]> wrote: > > >Synopsis: Panic in kern_event.c using Rust async tcp sockets on > >multi-cpu system > >Category: kernel > >Environment: > System : OpenBSD 6.6 > Details : OpenBSD 6.6-current (GENERIC.MP) #575: Mon Dec 30 > 04:47:45 MST 2019 > [email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > Architecture: OpenBSD.amd64 > Machine : amd64 > >Description: > I was experimenting with some Rust async-std code when I managed to > trigger a panic on my laptop. I reproduced the same panic in a 2 cpu > VirtualBox instance hosted on macOS and the issue seems to be (to me) > a race condition as I haven't been able to reliable reproduce it in a > single core machine. (I tried a -current guest in vmm on my Lenovo > x270 to no avail.) > > "fun" is the name of the offending rust program in the panic below, > which uses multiple threads to manage multiple TCP sockets. Basically > it's doing the following: > > 1. Spawning an event loop > 2. Spawning multiple async listeners on different TCP ports > 3. When a client connections, spawns another async handler for reading > data from the client connection. > 4. Terminating the listener if the client reads a shutdown message > ('die') signalling via an atomic boolean. > > There's a timeout being used (100ms) that I put in to allow for seeing > the boolean change so the listening socket can shutdown. > > Under the covers I believe the Rust async-std library is using kqueue > and setting sockets to non-blocking mode. > > From my serial console connection the panic and ddb output look like: > > ------ > panic: kernel diagnostic assertion "kn->kn_status & KN_PROCESSING" > failed: file "/usr/src/sys/kern/kern_event.c", line 1015 > Stopped at db_enter+0x10: popq %rbp > TID PID UID PRFLAGS PFLAGS CPU COMMAND > 465238 49168 1000 0x100003 0 1 nc > *490511 60196 1000 0x3 0x4000000 0K fun > db_enter() at db_enter+0x10 > panic(ffffffff81c5d8f2) at panic+0x128 > __assert(ffffffff81cb3133,ffffffff81c79b29,3f7,ffffffff81c6ea37) at > __assert+0x > 2b > kqueue_scan(fffffd81e112b2d8,3e8,ec9fdc9d000,0,ffff8000ffff0798,ffff8000221e61d > 8) at kqueue_scan+0x9ff > sys_kevent(ffff8000ffff0798,ffff8000221e6240,ffff8000221e62a0) at > sys_kevent+0x > 2a9 > syscall(ffff8000221e6310) at syscall+0x389 > Xsyscall() at Xsyscall+0x128 > end of kernel > end trace frame: 0xeca2e42da40, count: 8 > https://www.openbsd.org/ddb.html describes the minimum info required in bug > reports. Insufficient info makes it difficult to find and fix bugs. > ddb{0}> trace > trace > db_enter() at db_enter+0x10 > panic(ffffffff81c5d8f2) at panic+0x128 > __assert(ffffffff81cb3133,ffffffff81c79b29,3f7,ffffffff81c6ea37) at > __assert+0x > 2b > kqueue_scan(fffffd81e112b2d8,3e8,ec9fdc9d000,0,ffff8000ffff0798,ffff8000221e61d > 8) at kqueue_scan+0x9ff > sys_kevent(ffff8000ffff0798,ffff8000221e6240,ffff8000221e62a0) at > sys_kevent+0x > 2a9 > syscall(ffff8000221e6310) at syscall+0x389 > Xsyscall() at Xsyscall+0x128 > end of kernel > end trace frame: 0xeca2e42da40, count: -7 > ddb{0}> > db_enter() at db_enter+0x10 > end trace frame: 0xffff8000221e5da0, count: 0 > ddb{0}> machine ddbcpu 0 > machine ddbcpu 0 > Invalid cpu 0 > ddb{0}> > CPU not specified > ddb{0}> machine ddbcpu 1 > machine ddbcpu 1 > Stopped at x86_ipi_db+0x12: leave > x86_ipi_db(ffff800022010ff0) at x86_ipi_db+0x12 > x86_ipi_handler() at x86_ipi_handler+0x80 > Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23 > _kernel_lock() at _kernel_lock+0xa9 > Xsyscall() at Xsyscall+0x128 > end of kernel > end trace frame: 0x7f7ffffe81f0, count: 10 > ddb{1}> > CPU not specified > ddb{1}> machine ddbcpu 2 > machine ddbcpu 2 > Invalid cpu 2 > ddb{1}> > CPU not specified > ddb{1}> machine > machine > cpuinfo startcpu stopcpu ddbcpu acpi > ddb{1}> > CPU not specified > ddb{1}> machine cpuinfo > machine cpuinfo > 0: stopped > * 1: ddb > ddb{1}> > 0: stopped > * 1: ddb [snip the long ddb/ps output] > > >How-To-Repeat: > I only have a tedious manual reproduction at the moment, but it > involves running the below Rust 1.40 program and using a handful of nc > instances to hold open client connections while trying to kill the > listening socket. > > 1. Install the rust port in -current > 2. Clone https://github.com/voutilad/async-fun > 3. Run "cargo build --release" (it seems to occur only in release builds) > 4. Using something like tmux, start the program (./target/release/fun) > and then open multiple connections using netcat (nc localhost 9990). > 5. Quickly send a die command (echo die | nc localhost 9990) and then > switch to another nc instance and hit enter/return > 6. If (un)lucky, you'll induce a kernel panic > > I've been able to reproduce it both on a Lenovo x270 as well as a > VirtualBox guest instance both running the most recent snapshots. (The > dmesg below is from the vbox guest on macOS which is a cleaner > install.) > > >Fix: > (I have no idea :-) > [snipping dmesg]
FWIW, I haven't been able to reproduce the panic from kern_event.c using newer snapshots. This may have been related to work visa@ has been doing recently. https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/kern/kern_event.c.diff?r1=1.113&r2=1.112&f=h -Dave
