Hi, John Paul Adrian Glaubitz wrote: >> Not nice. I started compiling some stuff and the box froze, I connected >> serial console and could not resume due to Fast Data Access MMU miss" > So, this crash occurs with the latest 5.15 kernel on your T2000?
exactly latest kernel. I will retest it with stress-ng as soon as I finish this email and copy the dmesg errors. > In my experience, the most stable kernels on the older SPARCs are still the > 4.19 kernels. Thus, we should start bisecting to find out what commit actually > made the kernel unreliable on these older SPARCs. We must find a good way to test though. I stress-tested the 5.9 kernel further. The system sometimes seemed unresponsive, but eventually recovered (some errors to know more pasted below). Thus I would consider it "stable". I did run several small burst of tests and then a session given of 30m minutes but that due to hiccups lasted more like 2 hours, but afterwards, the machine was still up. sudo stress-ng --all 10 --timeout 30m 10 times means more than physical CPUs, but less than logical cores (32). The system has 16GB of ram, I see some OOMs in dmesg, I wonder if this is due to certain stress tests specifically going against any limit. [16195.300448] Unable to handle kernel NULL pointer dereference in mna handler [16195.741725] 40014fef [16195.741793] at virtual address 00000000000000e7 [16195.767936] b416801c [16195.767945] c2592468 [16195.767990] current->{active_,}mm->context = 0000000000000bb8 [16195.768848] b8100008 [16195.768857] 920126c8 [16195.769673] current->{active_,}mm->pgd = ffff800089cac000 [16195.770413] \|/ ____ \|/ "@'/ .. \`@" /_| \__/ |_\ \__U_/ [16196.303333] systemd-journald[219777]: /dev/kmsg buffer overrun, some messages lost. [16196.304235] stress-ng(234874): Oops [#864] [16196.304262] CPU: 8 PID: 234874 Comm: stress-ng Tainted: G D E X 5.9.0-5-sparc64-smp #1 Debian 5.9.15-1 [16196.304281] TSTATE: 0000008811001605 TPC: 000000000042d8e0 TNPC: 000000000042d8e4 Y: 00000000 Tainted: G D E X [16196.304311] TPC: <do_signal+0x440/0x560> [16196.304327] g0: 000000000040770c g1: 000000000000032f g2: 0000000000000000 g3: ffff80010007c000 [16196.304341] g4: ffff8003f13f9240 g5: ffff8003fdaa4000 g6: ffff800087df8000 g7: 0000000000004000 [16196.304355] o0: 00000000000001ef o1: 000000000000032f o2: ffff800087df8000 o3: 0000000000000007 [16196.304368] o4: 0000000000000007 o5: fffffffffffffff2 sp: ffff800087dfb451 ret_pc: 000000000042d8c4 [16196.304390] RPC: <do_signal+0x424/0x560> [16196.304404] l0: 0308000103000004 l1: 00000044f0001201 l2: 000000000040770c l3: 0000000000000000 [16196.304418] l4: 0000000000000000 l5: ffff80010007c000 l6: ffff800087df8000 l7: 0000000011001002 [16196.304432] i0: 0000000000000077 i1: 000000000000020f i2: fffffffffffffff2 i3: ffff800187dfff70 [16196.304445] i4: fffffffffffffff2 i5: 0000000000000007 i6: ffff800087dfb4d1 i7: 000000000042d6fc [16196.304472] I7: <do_signal+0x25c/0x560> [16205.284863] aes_sparc64: sparc64 aes opcodes not available. [16205.753417] Call Trace: [16205.753453] [<000000000042d6fc>] do_signal+0x25c/0x560 [16205.753478] [<000000000042e218>] do_notify_resume+0x58/0xa0 [16205.753500] [<0000000000404b48>] __handle_signal+0xc/0x30 [16205.753525] Caller[000000000042d6fc]: do_signal+0x25c/0x560 [16205.753546] Caller[000000000042e218]: do_notify_resume+0x58/0xa0 [16205.753562] Caller[0000000000404b48]: __handle_signal+0xc/0x30 [16205.753575] Caller[000001000007294c]: 0x1000007294c [16205.753580] Instruction DUMP: [16205.753587] c029a00d [16205.753595] b4168008 [16205.753602] 900761e8 [16205.753610] <d25e2070> [16205.753616] 40014fef [16205.753623] b416801c [16205.753629] c2592468 [16205.753636] b8100008 [16205.753644] 920126c8 then also these messages. I think they explain the "slowness" and apparent freeze of the system - I was about to power-cycle but waited and it recovered: [16253.233924] ata1.00: qc timeout (cmd 0xa0) [16335.213786] PM: hibernation: Basic memory bitmaps created [16830.619976] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [16830.620193] (detected by 18, t=5252 jiffies, g=711181, q=6) [16830.620215] rcu: All QSes seen, last rcu_sched kthread activity 1191 (4299098242-4299097051), jiffies_till_next_fqs=1, root ->qsmask 0x0 [16830.620491] rcu: rcu_sched kthread starved for 1191 jiffies! g711181 f0x2 RCU_GP_CLEANUP(7) ->state=0x0 ->cpu=30 [16830.620749] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [16830.620844] rcu: RCU grace-period kthread stack dump: [16830.621069] task:rcu_sched state:R running task stack: 0 pid: 10 ppid: 2 flags:0x05000000 [16830.621095] Call Trace: [16830.621128] [<0000000000bda560>] _cond_resched+0x40/0x60 [16830.621153] [<00000000004ee1d0>] rcu_gp_kthread+0x9b0/0xe40 [16830.621175] [<0000000000491c48>] kthread+0x108/0x120 [16830.621205] [<00000000004060c8>] ret_from_fork+0x1c/0x2c [16830.621224] [<0000000000000000>] 0x0 [16982.524373] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [16982.524591] (detected by 20, t=5252 jiffies, g=711637, q=15) [16982.524612] rcu: All QSes seen, last rcu_sched kthread activity 5247 (4299136209-4299130962), jiffies_till_next_fqs=1, root ->qsmask 0x0 [16982.524839] rcu: rcu_sched kthread starved for 5247 jiffies! g711637 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=16 [16982.525098] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [16982.525201] rcu: RCU grace-period kthread stack dump: [16982.525377] task:rcu_sched state:R running task stack: 0 pid: 10 ppid: 2 flags:0x06000000 [16982.525404] Call Trace: [16982.525435] [<0000000000bda3d4>] schedule+0x54/0x100 [16982.525464] [<0000000000bddc50>] schedule_timeout+0x70/0x140 [16982.525489] [<00000000004edeb4>] rcu_gp_kthread+0x694/0xe40 [16982.525511] [<0000000000491c48>] kthread+0x108/0x120 [16982.525540] [<00000000004060c8>] ret_from_fork+0x1c/0x2c [16982.525558] [<0000000000000000>] 0x0 [17596.494910] sched: RT throttling activated [17664.665608] PM: hibernation: Basic memory bitmaps freed [17664.838884] audit: type=1400 audit(1642442424.829:817): apparmor="STATUS" info="failed to unpack policydb" error=-86 profile="unconfined" name="/usr/bin/pulseaudio-eg" pid=234012 comm="stress-ng" name="/usr/bin/pulseaudio-eg" offset=2536 [17665.077468] aes_sparc64: sparc64 aes opcodes not available. [17665.685823] aes_sparc64: sparc64 aes opcodes not available. [17686.297683] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=6/ABRT [17686.300569] systemd[1]: systemd-journald.service: Failed with result 'watchdog'. [17686.733029] systemd[1]: systemd-journald.service: Consumed 53.065s CPU time. [17686.938707] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 3. [17687.012114] systemd[1]: Stopped Journal Service. [17687.020312] systemd[1]: systemd-journald.service: Consumed 53.065s CPU time. [17690.324815] systemd[1]: Starting Journal Service... [17690.831298] systemd-journald[258852]: File /var/log/journal/bdb2a41ce825489ba567bea53add247e/system.journal corrupted or uncleanly shut down, renaming and replacing. [17709.718653] systemd[1]: Started Journal Service. Perhaps we can at least understand these error and restrict to specific tests? This could gives us a better testing and also Frank could try to run the same tests on his systems. Riccardo