On Mon, Nov 01 2021, Martin Pieuchot <m...@openbsd.org> wrote:
> On 31/10/21(Sun) 15:57, Jeremie Courreges-Anglas wrote:
>> On Fri, Oct 08 2021, Jeremie Courreges-Anglas <j...@wxcvbn.org> wrote:
>> > riscv64.ports was running dpb(1) with two other members in the build
>> > cluster.  A few minutes ago I found it in ddb(4).  The report is short,
>> > sadly, as the machine doesn't return from the 'bt' command.
>> >
>> > The machine is acting both as an NFS server and and NFS client.
>> >
>> > OpenBSD/riscv64 (riscv64.ports.openbsd.org) (console)
>> >
>> > login: panic: pool_anic:t: pol_ free l: p mod fiee liat m  oxifief:c a2e 
>> > 07ff0ff fte21ade0 00f ifem c0d
>> > 1 07f1f0ffcf2177 010=0 c16ce6 7x090xc52c !
>> > 0x9066d21 919 xc1521
>> > Stopped at      panic+0xfe:     addi    a0,zero,256    TID    PID    UID   
>> >   PR
>> > FLAGS     PFLAGS  CPU  COMMAND
>> >   24243  43192     55         0x2          0    0  cc
>> > *480349  52543      0        0x11          0    1  perl
>> >  480803  72746     55         0x2          0    3  c++
>> >  366351   3003     55         0x2          0    2K c++
>> > panic() at panic+0xfa
>> > panic() at pool_do_get+0x29a
>> > pool_do_get() at pool_get+0x76
>> > pool_get() at pmap_enter+0x128
>> > pmap_enter() at uvm_fault_upper+0x1c2
>> > uvm_fault_upper() at uvm_fault+0xb2
>> > uvm_fault() at do_trap_user+0x120
>> > https://www.openbsd.org/ddb.html describes the minimum info required in bug
>> > reports.  Insufficient info makes it difficult to find and fix bugs.
>> > ddb{1}> bt
>> > panic() at panic+0xfa
>> > panic() at pool_do_get+0x29a
>> > pool_do_get() at pool_get+0x76
>> > pool_get() at pmap_enter+0x128
>> > pmap_enter() at uvm_fault_upper+0x1c2
>> > uvm_fault_upper() at uvm_fault+0xb2
>> > uvm_fault() at do_trap_user+0x120
>> > do_trap_user() at cpu_exception_handler_user+0x7a
>> > <hangs>
>> 
>> Another panic on riscv64-1, a new board which doesn't have RTC/I2C
>> problems anymore and is acting as a dpb(1) cluster member/NFS client.
>
> Why are both traces ending in pool_do_get()?  Are CPU0 and CPU1 there at
> the same time?
>
> This corruption as well as the one above arise in the top part of the
> fault handler which already runs concurrently.  Did you try putting
> KERNEL_LOCK/UNLOCK() dances around uvm_fault() in trap.c?  That could
> help figure out if something is still unsafe in riscv64's pmap.

On my riscv64 I did add locking around the two uvm_fault() calls as
suggested, rebooted, then started building libcrypto and libssl and left
the place.  Sadly the box is now unreachable (panic?) and will stay as
is for the next days.  I'll get back to it on sunday.

Since I haven't mentioned it in this thread, clang crashes with SIGSEGV
often when building ports.  For the two first published bulk builds
I just restarted the failed ports.

-- 
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE

Reply via email to