On Mon, Nov 01 2021, Martin Pieuchot <m...@openbsd.org> wrote: > On 31/10/21(Sun) 15:57, Jeremie Courreges-Anglas wrote: >> On Fri, Oct 08 2021, Jeremie Courreges-Anglas <j...@wxcvbn.org> wrote: >> > riscv64.ports was running dpb(1) with two other members in the build >> > cluster. A few minutes ago I found it in ddb(4). The report is short, >> > sadly, as the machine doesn't return from the 'bt' command. >> > >> > The machine is acting both as an NFS server and and NFS client. >> > >> > OpenBSD/riscv64 (riscv64.ports.openbsd.org) (console) >> > >> > login: panic: pool_anic:t: pol_ free l: p mod fiee liat m oxifief:c a2e >> > 07ff0ff fte21ade0 00f ifem c0d >> > 1 07f1f0ffcf2177 010=0 c16ce6 7x090xc52c ! >> > 0x9066d21 919 xc1521 >> > Stopped at panic+0xfe: addi a0,zero,256 TID PID UID >> > PR >> > FLAGS PFLAGS CPU COMMAND >> > 24243 43192 55 0x2 0 0 cc >> > *480349 52543 0 0x11 0 1 perl >> > 480803 72746 55 0x2 0 3 c++ >> > 366351 3003 55 0x2 0 2K c++ >> > panic() at panic+0xfa >> > panic() at pool_do_get+0x29a >> > pool_do_get() at pool_get+0x76 >> > pool_get() at pmap_enter+0x128 >> > pmap_enter() at uvm_fault_upper+0x1c2 >> > uvm_fault_upper() at uvm_fault+0xb2 >> > uvm_fault() at do_trap_user+0x120 >> > https://www.openbsd.org/ddb.html describes the minimum info required in bug >> > reports. Insufficient info makes it difficult to find and fix bugs. >> > ddb{1}> bt >> > panic() at panic+0xfa >> > panic() at pool_do_get+0x29a >> > pool_do_get() at pool_get+0x76 >> > pool_get() at pmap_enter+0x128 >> > pmap_enter() at uvm_fault_upper+0x1c2 >> > uvm_fault_upper() at uvm_fault+0xb2 >> > uvm_fault() at do_trap_user+0x120 >> > do_trap_user() at cpu_exception_handler_user+0x7a >> > <hangs> >> >> Another panic on riscv64-1, a new board which doesn't have RTC/I2C >> problems anymore and is acting as a dpb(1) cluster member/NFS client. > > Why are both traces ending in pool_do_get()? Are CPU0 and CPU1 there at > the same time? > > This corruption as well as the one above arise in the top part of the > fault handler which already runs concurrently. Did you try putting > KERNEL_LOCK/UNLOCK() dances around uvm_fault() in trap.c? That could > help figure out if something is still unsafe in riscv64's pmap.
On my riscv64 I did add locking around the two uvm_fault() calls as suggested, rebooted, then started building libcrypto and libssl and left the place. Sadly the box is now unreachable (panic?) and will stay as is for the next days. I'll get back to it on sunday. Since I haven't mentioned it in this thread, clang crashes with SIGSEGV often when building ports. For the two first published bulk builds I just restarted the failed ports. -- jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE