Hi,

John Paul Adrian Glaubitz wrote:
>> Not nice. I started compiling some stuff and the box froze, I connected
>> serial console and could not resume due to Fast Data Access MMU miss"
> So, this crash occurs with the latest 5.15 kernel on your T2000?

exactly latest kernel.

I will retest it with stress-ng as soon as I finish this email and copy
the dmesg errors.

> In my experience, the most stable kernels on the older SPARCs are still the
> 4.19 kernels. Thus, we should start bisecting to find out what commit actually
> made the kernel unreliable on these older SPARCs.


We must find a good way to test though. I stress-tested the 5.9 kernel
further. The system sometimes seemed unresponsive, but eventually
recovered (some errors to know more pasted below). Thus I would consider
it "stable". I did run several small burst of tests and then a session
given of 30m minutes but that due to hiccups lasted more like 2 hours,
but afterwards, the machine was still up.

 sudo stress-ng --all 10 --timeout 30m

10 times means more than physical CPUs, but less than logical cores
(32). The system has 16GB of ram, I see some OOMs in dmesg, I wonder if
this is due to certain stress tests specifically going against any limit.

[16195.300448] Unable to handle kernel NULL pointer dereference in mna
handler
[16195.741725]  40014fef
[16195.741793]  at virtual address 00000000000000e7
[16195.767936]  b416801c
[16195.767945]  c2592468
[16195.767990] current->{active_,}mm->context = 0000000000000bb8
[16195.768848]  b8100008
[16195.768857]  920126c8
[16195.769673] current->{active_,}mm->pgd = ffff800089cac000

[16195.770413]               \|/ ____ \|/
                             "@'/ .. \`@"
                             /_| \__/ |_\
                                \__U_/
[16196.303333] systemd-journald[219777]: /dev/kmsg buffer overrun, some
messages lost.
[16196.304235] stress-ng(234874): Oops [#864]
[16196.304262] CPU: 8 PID: 234874 Comm: stress-ng Tainted: G      D
E  X  5.9.0-5-sparc64-smp #1 Debian 5.9.15-1
[16196.304281] TSTATE: 0000008811001605 TPC: 000000000042d8e0 TNPC:
000000000042d8e4 Y: 00000000    Tainted: G      D     E  X
[16196.304311] TPC: <do_signal+0x440/0x560>
[16196.304327] g0: 000000000040770c g1: 000000000000032f g2:
0000000000000000 g3: ffff80010007c000
[16196.304341] g4: ffff8003f13f9240 g5: ffff8003fdaa4000 g6:
ffff800087df8000 g7: 0000000000004000
[16196.304355] o0: 00000000000001ef o1: 000000000000032f o2:
ffff800087df8000 o3: 0000000000000007
[16196.304368] o4: 0000000000000007 o5: fffffffffffffff2 sp:
ffff800087dfb451 ret_pc: 000000000042d8c4
[16196.304390] RPC: <do_signal+0x424/0x560>
[16196.304404] l0: 0308000103000004 l1: 00000044f0001201 l2:
000000000040770c l3: 0000000000000000
[16196.304418] l4: 0000000000000000 l5: ffff80010007c000 l6:
ffff800087df8000 l7: 0000000011001002
[16196.304432] i0: 0000000000000077 i1: 000000000000020f i2:
fffffffffffffff2 i3: ffff800187dfff70
[16196.304445] i4: fffffffffffffff2 i5: 0000000000000007 i6:
ffff800087dfb4d1 i7: 000000000042d6fc
[16196.304472] I7: <do_signal+0x25c/0x560>
[16205.284863] aes_sparc64: sparc64 aes opcodes not available.
[16205.753417] Call Trace:
[16205.753453] [<000000000042d6fc>] do_signal+0x25c/0x560
[16205.753478] [<000000000042e218>] do_notify_resume+0x58/0xa0
[16205.753500] [<0000000000404b48>] __handle_signal+0xc/0x30
[16205.753525] Caller[000000000042d6fc]: do_signal+0x25c/0x560
[16205.753546] Caller[000000000042e218]: do_notify_resume+0x58/0xa0
[16205.753562] Caller[0000000000404b48]: __handle_signal+0xc/0x30
[16205.753575] Caller[000001000007294c]: 0x1000007294c
[16205.753580] Instruction DUMP:
[16205.753587]  c029a00d
[16205.753595]  b4168008
[16205.753602]  900761e8
[16205.753610] <d25e2070>
[16205.753616]  40014fef
[16205.753623]  b416801c
[16205.753629]  c2592468
[16205.753636]  b8100008
[16205.753644]  920126c8


then also these messages. I think they explain the "slowness" and
apparent freeze of the system - I was about to power-cycle but waited
and it recovered:

[16253.233924] ata1.00: qc timeout (cmd 0xa0)
[16335.213786] PM: hibernation: Basic memory bitmaps created
[16830.619976] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[16830.620193]  (detected by 18, t=5252 jiffies, g=711181, q=6)
[16830.620215] rcu: All QSes seen, last rcu_sched kthread activity 1191
(4299098242-4299097051), jiffies_till_next_fqs=1, root ->qsmask 0x0
[16830.620491] rcu: rcu_sched kthread starved for 1191 jiffies! g711181
f0x2 RCU_GP_CLEANUP(7) ->state=0x0 ->cpu=30
[16830.620749] rcu:     Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[16830.620844] rcu: RCU grace-period kthread stack dump:
[16830.621069] task:rcu_sched       state:R  running task     stack:
0 pid:   10 ppid:     2 flags:0x05000000
[16830.621095] Call Trace:
[16830.621128] [<0000000000bda560>] _cond_resched+0x40/0x60
[16830.621153] [<00000000004ee1d0>] rcu_gp_kthread+0x9b0/0xe40
[16830.621175] [<0000000000491c48>] kthread+0x108/0x120
[16830.621205] [<00000000004060c8>] ret_from_fork+0x1c/0x2c
[16830.621224] [<0000000000000000>] 0x0
[16982.524373] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[16982.524591]  (detected by 20, t=5252 jiffies, g=711637, q=15)
[16982.524612] rcu: All QSes seen, last rcu_sched kthread activity 5247
(4299136209-4299130962), jiffies_till_next_fqs=1, root ->qsmask 0x0
[16982.524839] rcu: rcu_sched kthread starved for 5247 jiffies! g711637
f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=16
[16982.525098] rcu:     Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[16982.525201] rcu: RCU grace-period kthread stack dump:
[16982.525377] task:rcu_sched       state:R  running task     stack:
0 pid:   10 ppid:     2 flags:0x06000000
[16982.525404] Call Trace:
[16982.525435] [<0000000000bda3d4>] schedule+0x54/0x100
[16982.525464] [<0000000000bddc50>] schedule_timeout+0x70/0x140
[16982.525489] [<00000000004edeb4>] rcu_gp_kthread+0x694/0xe40
[16982.525511] [<0000000000491c48>] kthread+0x108/0x120
[16982.525540] [<00000000004060c8>] ret_from_fork+0x1c/0x2c
[16982.525558] [<0000000000000000>] 0x0
[17596.494910] sched: RT throttling activated
[17664.665608] PM: hibernation: Basic memory bitmaps freed
[17664.838884] audit: type=1400 audit(1642442424.829:817):
apparmor="STATUS" info="failed to unpack policydb" error=-86
profile="unconfined" name="/usr/bin/pulseaudio-eg" pid=234012
comm="stress-ng" name="/usr/bin/pulseaudio-eg" offset=2536
[17665.077468] aes_sparc64: sparc64 aes opcodes not available.
[17665.685823] aes_sparc64: sparc64 aes opcodes not available.
[17686.297683] systemd[1]: systemd-journald.service: Main process
exited, code=killed, status=6/ABRT
[17686.300569] systemd[1]: systemd-journald.service: Failed with result
'watchdog'.
[17686.733029] systemd[1]: systemd-journald.service: Consumed 53.065s
CPU time.
[17686.938707] systemd[1]: systemd-journald.service: Scheduled restart
job, restart counter is at 3.
[17687.012114] systemd[1]: Stopped Journal Service.
[17687.020312] systemd[1]: systemd-journald.service: Consumed 53.065s
CPU time.
[17690.324815] systemd[1]: Starting Journal Service...
[17690.831298] systemd-journald[258852]: File
/var/log/journal/bdb2a41ce825489ba567bea53add247e/system.journal
corrupted or uncleanly shut down, renaming and replacing.
[17709.718653] systemd[1]: Started Journal Service.



Perhaps we can at least understand these error and restrict to specific
tests? This could gives us a better testing and also Frank could try to
run the same tests on his systems.

Riccardo

Reply via email to