On Mon, Nov 01 2021, Jeremie Courreges-Anglas <j...@wxcvbn.org> wrote:
> On Mon, Nov 01 2021, Jeremie Courreges-Anglas <j...@wxcvbn.org> wrote:
>> On Mon, Nov 01 2021, Martin Pieuchot <m...@openbsd.org> wrote:
>>> On 31/10/21(Sun) 15:57, Jeremie Courreges-Anglas wrote:
>>>> On Fri, Oct 08 2021, Jeremie Courreges-Anglas <j...@wxcvbn.org> wrote:
>>>> > riscv64.ports was running dpb(1) with two other members in the build
>>>> > cluster.  A few minutes ago I found it in ddb(4).  The report is short,
>>>> > sadly, as the machine doesn't return from the 'bt' command.
>>>> >
>>>> > The machine is acting both as an NFS server and and NFS client.
>>>> >
>>>> > OpenBSD/riscv64 (riscv64.ports.openbsd.org) (console)
>>>> >
>>>> > login: panic: pool_anic:t: pol_ free l: p mod fiee liat m  oxifief:c a2e 
>>>> > 07ff0ff fte21ade0 00f ifem c0d
>>>> > 1 07f1f0ffcf2177 010=0 c16ce6 7x090xc52c !
>>>> > 0x9066d21 919 xc1521
>>>> > Stopped at      panic+0xfe:     addi    a0,zero,256    TID    PID    UID 
>>>> >     PR
>>>> > FLAGS     PFLAGS  CPU  COMMAND
>>>> >   24243  43192     55         0x2          0    0  cc
>>>> > *480349  52543      0        0x11          0    1  perl
>>>> >  480803  72746     55         0x2          0    3  c++
>>>> >  366351   3003     55         0x2          0    2K c++
>>>> > panic() at panic+0xfa
>>>> > panic() at pool_do_get+0x29a
>>>> > pool_do_get() at pool_get+0x76
>>>> > pool_get() at pmap_enter+0x128
>>>> > pmap_enter() at uvm_fault_upper+0x1c2
>>>> > uvm_fault_upper() at uvm_fault+0xb2
>>>> > uvm_fault() at do_trap_user+0x120
>>>> > https://www.openbsd.org/ddb.html describes the minimum info required in 
>>>> > bug
>>>> > reports.  Insufficient info makes it difficult to find and fix bugs.
>>>> > ddb{1}> bt
>>>> > panic() at panic+0xfa
>>>> > panic() at pool_do_get+0x29a
>>>> > pool_do_get() at pool_get+0x76
>>>> > pool_get() at pmap_enter+0x128
>>>> > pmap_enter() at uvm_fault_upper+0x1c2
>>>> > uvm_fault_upper() at uvm_fault+0xb2
>>>> > uvm_fault() at do_trap_user+0x120
>>>> > do_trap_user() at cpu_exception_handler_user+0x7a
>>>> > <hangs>
>>>> 
>>>> Another panic on riscv64-1, a new board which doesn't have RTC/I2C
>>>> problems anymore and is acting as a dpb(1) cluster member/NFS client.
>>>
>>> Why are both traces ending in pool_do_get()?  Are CPU0 and CPU1 there at
>>> the same time?
>>>
>>> This corruption as well as the one above arise in the top part of the
>>> fault handler which already runs concurrently.  Did you try putting
>>> KERNEL_LOCK/UNLOCK() dances around uvm_fault() in trap.c?  That could
>>> help figure out if something is still unsafe in riscv64's pmap.
>
> I'll try that on the ports bulk build machines.  After all, that's where
> I hit most/all the panics and clang crashes.
>
>> On my riscv64 I did add locking around the two uvm_fault() calls as
>> suggested, rebooted, then started building libcrypto and libssl and left
>> the place.  Sadly the box is now unreachable (panic?) and will stay as
>> is for the next days.  I'll get back to it on sunday.
>
> That was a bit premature, I finally managed to remotely connect to the
> machine.  No idea why I couldn't connect to it for so long.
[...]

In the end I really tried to run a kernel with KERNEL_LOCK/UNLOCK added
around uvm_fault() in trap.c, on all riscv64*.p machines.  And the
result was a crash of both riscv64-1.p and riscv64.p in less than two
hours, something I never got before.

So while kernel-locking uvm_fault() again didn't fix the crashes, maybe
it pushed uvm into more consistent crashes?  One way to know would be to
start experiments, but I can't reboot those machines at will... :-/

-- 
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE

Reply via email to