On Mon, Dec 30, 2019 at 3:16 PM Dave Voutila <[email protected]> wrote:
>
> >Synopsis:      Panic in kern_event.c using Rust async tcp sockets on 
> >multi-cpu system
> >Category:      kernel
> >Environment:
> System      : OpenBSD 6.6
> Details     : OpenBSD 6.6-current (GENERIC.MP) #575: Mon Dec 30
> 04:47:45 MST 2019
> [email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
>
> Architecture: OpenBSD.amd64
> Machine     : amd64
> >Description:
> I was experimenting with some Rust async-std code when I managed to
> trigger a panic on my laptop. I reproduced the same panic in a 2 cpu
> VirtualBox instance hosted on macOS and the issue seems to be (to me)
> a race condition as I haven't been able to reliable reproduce it in a
> single core machine. (I tried a -current guest in vmm on my Lenovo
> x270 to no avail.)
>
> "fun" is the name of the offending rust program in the panic below,
> which uses multiple threads to manage multiple TCP sockets. Basically
> it's doing the following:
>
> 1. Spawning an event loop
> 2. Spawning multiple async listeners on different TCP ports
> 3. When a client connections, spawns another async handler for reading
> data from the client connection.
> 4. Terminating the listener if the client reads a shutdown message
> ('die') signalling via an atomic boolean.
>
> There's a timeout being used (100ms) that I put in to allow for seeing
> the boolean change so the listening socket can shutdown.
>
> Under the covers I believe the Rust async-std library is using kqueue
> and setting sockets to non-blocking mode.
>
> From my serial console connection the panic and ddb output look like:
>
> ------
> panic: kernel diagnostic assertion "kn->kn_status & KN_PROCESSING"
> failed: file "/usr/src/sys/kern/kern_event.c", line 1015
> Stopped at      db_enter+0x10:  popq    %rbp
>     TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
>  465238  49168   1000    0x100003          0    1  nc
> *490511  60196   1000         0x3  0x4000000    0K fun
> db_enter() at db_enter+0x10
> panic(ffffffff81c5d8f2) at panic+0x128
> __assert(ffffffff81cb3133,ffffffff81c79b29,3f7,ffffffff81c6ea37) at 
> __assert+0x
> 2b
> kqueue_scan(fffffd81e112b2d8,3e8,ec9fdc9d000,0,ffff8000ffff0798,ffff8000221e61d
> 8) at kqueue_scan+0x9ff
> sys_kevent(ffff8000ffff0798,ffff8000221e6240,ffff8000221e62a0) at 
> sys_kevent+0x
> 2a9
> syscall(ffff8000221e6310) at syscall+0x389
> Xsyscall() at Xsyscall+0x128
> end of kernel
> end trace frame: 0xeca2e42da40, count: 8
> https://www.openbsd.org/ddb.html describes the minimum info required in bug
> reports.  Insufficient info makes it difficult to find and fix bugs.
> ddb{0}> trace
> trace
> db_enter() at db_enter+0x10
> panic(ffffffff81c5d8f2) at panic+0x128
> __assert(ffffffff81cb3133,ffffffff81c79b29,3f7,ffffffff81c6ea37) at 
> __assert+0x
> 2b
> kqueue_scan(fffffd81e112b2d8,3e8,ec9fdc9d000,0,ffff8000ffff0798,ffff8000221e61d
> 8) at kqueue_scan+0x9ff
> sys_kevent(ffff8000ffff0798,ffff8000221e6240,ffff8000221e62a0) at 
> sys_kevent+0x
> 2a9
> syscall(ffff8000221e6310) at syscall+0x389
> Xsyscall() at Xsyscall+0x128
> end of kernel
> end trace frame: 0xeca2e42da40, count: -7
> ddb{0}>
> db_enter() at db_enter+0x10
> end trace frame: 0xffff8000221e5da0, count: 0
> ddb{0}> machine ddbcpu 0
> machine ddbcpu 0
> Invalid cpu 0
> ddb{0}>
> CPU not specified
> ddb{0}> machine ddbcpu 1
> machine ddbcpu 1
> Stopped at      x86_ipi_db+0x12:        leave
> x86_ipi_db(ffff800022010ff0) at x86_ipi_db+0x12
> x86_ipi_handler() at x86_ipi_handler+0x80
> Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
> _kernel_lock() at _kernel_lock+0xa9
> Xsyscall() at Xsyscall+0x128
> end of kernel
> end trace frame: 0x7f7ffffe81f0, count: 10
> ddb{1}>
> CPU not specified
> ddb{1}> machine ddbcpu 2
> machine ddbcpu 2
> Invalid cpu 2
> ddb{1}>
> CPU not specified
> ddb{1}> machine
> machine
> cpuinfo     startcpu    stopcpu     ddbcpu      acpi
> ddb{1}>
> CPU not specified
> ddb{1}> machine cpuinfo
> machine cpuinfo
>     0: stopped
> *   1: ddb
> ddb{1}>
>     0: stopped
> *   1: ddb
[snip the long ddb/ps output]
>
> >How-To-Repeat:
> I only have a tedious manual reproduction at the moment, but it
> involves running the below Rust 1.40 program and using a handful of nc
> instances to hold open client connections while trying to kill the
> listening socket.
>
> 1. Install the rust port in -current
> 2. Clone https://github.com/voutilad/async-fun
> 3. Run "cargo build --release" (it seems to occur only in release builds)
> 4. Using something like tmux, start the program (./target/release/fun)
> and then open multiple connections using netcat (nc localhost 9990).
> 5. Quickly send a die command (echo die | nc localhost 9990) and then
> switch to another nc instance and hit enter/return
> 6. If (un)lucky, you'll induce a kernel panic
>
> I've been able to reproduce it both on a Lenovo x270 as well as a
> VirtualBox guest instance both running the most recent snapshots. (The
> dmesg below is from the vbox guest on macOS which is a cleaner
> install.)
>
> >Fix:
> (I have no idea :-)
>
[snipping dmesg]

FWIW, I haven't been able to reproduce the panic from kern_event.c
using newer snapshots. This may have been related to work visa@ has
been doing recently.

https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/kern/kern_event.c.diff?r1=1.113&r2=1.112&f=h

-Dave

Reply via email to