Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2022-01-17 Thread Riccardo Mottola
Hi Adrian,

John Paul Adrian Glaubitz wrote:
> Did you forget to create an initrd? After installing the kernel, run:
> 
> $ update-initramfs -k KERNEL_VERSION -c

I did not run it this way, will do.

I had it however, of a very big size:
316M Jan 14 17:15 initrd.img-5.9.0-rc1+

which filled up my /boot

I removed it, regenerated with your command, but I get dropped into
initramfs with no modules found. Hmm..

> 
>> The good news is that latest kernel installed seems to boot and takes
>> all CPUs online. How stable it is I don't know, it needs to be tested.
> Please run some stress tests such as stress-ng and report back.

Not nice. I started compiling some stuff and the box froze, I connected
serial console and could not resume due to Fast Data Access MMU miss"

I will now stress things again, but keeping serial console attached with
another computer and see.

UP to last week with the old 5.9 kernel I had no issues compiling even
large things as gecko based ArcticFox or the Linux kernel itself. So if
the Fire didn't fail over the weekend it smells as kernel instability.

What should I use in stress-ng? I just tried "--all 8 --timeout 120s"

and the machine locked up after a little and in the serial console I see:

[ 8563.833509] current->{active_,}mm->context = 0fcb

[ 8563.833523] current->{active_,}mm->pgd = 8000d35c8000

[ 8563.846347] Unable to handle kernel NULL pointer dereference in mna
handler
[ 8563.846365]  at virtual address 00e7

[ 8563.846380] current->{active_,}mm->context = 0fcc

[ 8563.846395] current->{active_,}mm->pgd = 8000d2d3c000

[ 8563.856171] Unable to handle kernel NULL pointer dereference

[ 8563.863274] tsk->{mm,active_mm}->context = 0fd2

[ 8563.863294] tsk->{mm,active_mm}->pgd = 8000d3fc

[ 8563.928911] Unable to handle kernel NULL pointer dereference in mna
handler
[ 8563.928935]  at virtual address 00e7

[ 8563.928955] current->{active_,}mm->context = 0fde

[ 8563.928972] current->{active_,}mm->pgd = 8000d32e8000

[ 8563.952221] Unable to handle kernel NULL pointer dereference in mna
handler
[ 8563.952244]  at virtual address 00e7

[ 8563.952261] current->{active_,}mm->context = 0fe3

[ 8563.952278] current->{active_,}mm->pgd = 8000d2f54000

[ 8563.954004] Unable to handle kernel NULL pointer dereference in mna
handler
[ 8563.954022]  at virtual address 00e7

[ 8563.954037] current->{active_,}mm->context = 0fe5

[ 8563.954053] current->{active_,}mm->pgd = 8000d2d5c000

[ 8563.972643] Unable to handle kernel NULL pointer dereference

[ 8563.972660] tsk->{mm,active_mm}->context = 0fea

[ 8563.972677] tsk->{mm,active_mm}->pgd = 8000d31300

These are kernel messages, not OF, so it looks like a kernel problem

Riccardo



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2022-01-17 Thread John Paul Adrian Glaubitz
Hi!

On 1/17/22 14:41, Riccardo Mottola wrote:
>>> The good news is that latest kernel installed seems to boot and takes
>>> all CPUs online. How stable it is I don't know, it needs to be tested.
>>
>> Please run some stress tests such as stress-ng and report back.
> 
> Not nice. I started compiling some stuff and the box froze, I connected
> serial console and could not resume due to Fast Data Access MMU miss"

So, this crash occurs with the latest 5.15 kernel on your T2000?

In my experience, the most stable kernels on the older SPARCs are still the
4.19 kernels. Thus, we should start bisecting to find out what commit actually
made the kernel unreliable on these older SPARCs.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2022-01-17 Thread Riccardo Mottola
I reply to myself.

I did run the old 5.9 kernel from debian - which has proven quite stable.
I did run the same tests... and I found once error in the console indeed.


[  380.918996] Unable to handle kernel NULL pointer dereference
[  380.919198] tsk->{mm,active_mm}->context = 057d
[  380.919326] tsk->{mm,active_mm}->pgd = 8003f1fd4000
[  380.919496]   \|/  \|/
 "@'/ .. \`@"
 /_| \__/ |_\
\__U_/
[  380.919510] stress-ng(1529): Oops [#287]
[  380.919536] CPU: 24 PID: 1529 Comm: stress-ng Tainted: G  D E
 X  5.9.0-5-sparc64-smp #1 Debian 5.9.15-1
[  380.919557] TSTATE: 008811001602 TPC: 0042d8e0 TNPC:
0042d8e4 Y: Tainted: G  D E  X
[  380.919587] TPC: 
[  380.919604] g0: 800100ef7194 g1: 0328 g2:
 g3: 80010002c000
[  380.919620] g4: 8003cf6f6b40 g5: 8003fdea4000 g6:
8003cf9cc000 g7: 4000
[  380.919634] o0: 01e8 o1: 0328 o2:
8003cf9cc000 o3: 0007
[  380.919650] o4: 0007 o5: fff2 sp:
8003cf9cf451 ret_pc: 0042d8c4
[  380.919673] RPC: 
[  380.919690] l0: 020800010404 l1: 0044f226 l2:
800100ef7194 l3: 
[  380.919705] l4:  l5: 0005 l6:
8003cf9cc000 l7: 00698c20
[  380.919719] i0: 0070 i1: 0208 i2:
fff2 i3: 8003cf9eff70
[  380.919732] i4: fff2 i5:  i6:
8003cf9cf4d1 i7: 0042d6fc
[  380.919752] I7: 
[  380.919760] Call Trace:
[  380.919783] [<0042d6fc>] do_signal+0x25c/0x560
[  380.919806] [<0042e218>] do_notify_resume+0x58/0xa0
[  380.919828] [<00404b48>] __handle_signal+0xc/0x30
[  380.919852] Caller[0042d6fc]: do_signal+0x25c/0x560
[  380.919874] Caller[0042e218]: do_notify_resume+0x58/0xa0
[  380.919893] Caller[00404b48]: __handle_signal+0xc/0x30
[  380.919910] Caller[800100ef716c]: 0x800100ef716c
[  380.919916] Instruction DUMP:
[  380.919923]  c029a00d
[  380.919930]  b4168008
[  380.919938]  900761e8
[  380.919945] 
[  380.919952]  40014fef
[  380.919959]  b416801c
[  380.919965]  c2592468
[  380.919972]  b818
[  380.919979]  920126c8

[  380.972358] systemd-journald[66048]: File
/var/log/journal/bdb2a41ce825489ba567bea53add247e/system.journal
corrupted or uncleanly shut down, renaming and replacing.
[  407.494981] systemd[1]: Started Journal Service.


as well as error in the stressors:
stress-ng: info:  [12989] stress-ng-fanotify: 148 open, 41 close write,
128 close nowrite, 96 access, 27 modify
stress-ng: info:  [12908] stress-ng-fanotify: 159 open, 66 close write,
108 close nowrite, 88 access, 43 modify
stress-ng: info:  [12911] stress-ng-fanotify: 147 open, 43 close write,
122 close nowrite, 99 access, 20 modify
stress-ng: info:  [13079] stress-ng-fanotify: 159 open, 60 close write,
112 close nowrite, 97 access, 32 modify
stress-ng: info:  [12820] stress-ng-fanotify: 155 open, 46 close write,
123 close nowrite, 87 access, 27 modify
stress-ng: info:  [913] unsuccessful run completed in 282.58s (4 mins,
42.58 secs)
stress-ng: fail:  [913] chattr instance 2 corrupted bogo-ops counter, 48
vs 0
stress-ng: fail:  [913] chattr instance 2 hash error in bogo-ops counter
and run flag, 1918819509 vs 0
stress-ng: fail:  [913] chattr instance 6 corrupted bogo-ops counter, 50
vs 0
stress-ng: fail:  [913] chattr instance 6 hash error in bogo-ops counter
and run flag, 506138270 vs 0
stress-ng: fail:  [913] dnotify instance 4 corrupted bogo-ops counter,
224 vs 0
info: 5 failures reached, aborting stress process
stress-ng: fail:  [913] dnotify instance 4 hash error in bogo-ops
counter and run flag, 1503783545 vs 0
stress-ng: fail:  [913] dnotify instance 6 corrupted bogo-ops counter,
222 vs 0
stress-ng: fail:  [913] dnotify instance 6 hash error in bogo-ops
counter and run flag, 4199465241 vs 0
stress-ng: fail:  [913] metrics-check: stressor metrics corrupted, data
is compromised


However the machine did not crash.
I did run exactly the same stress command again... and the failures are
reproducible, so I suppose maybe the tests are not 64bit big endian safe
or such.



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2022-01-17 Thread Riccardo Mottola
Hi,


Riccardo Mottola wrote:
> John Paul Adrian Glaubitz wrote:
>>> Not nice. I started compiling some stuff and the box froze, I connected
>>> serial console and could not resume due to Fast Data Access MMU miss"
>> So, this crash occurs with the latest 5.15 kernel on your T2000?
> exactly latest kernel.
> 
> I will retest it with stress-ng as soon as I finish this email and copy
> the dmesg errors.
> 


wow, running the test suite once or twice, I am able to have the system
power-cycle... wow

Frank test latest kernel on yours :)

Riccardo



Re: 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

2022-01-17 Thread Riccardo Mottola
Hi,

John Paul Adrian Glaubitz wrote:
>> Not nice. I started compiling some stuff and the box froze, I connected
>> serial console and could not resume due to Fast Data Access MMU miss"
> So, this crash occurs with the latest 5.15 kernel on your T2000?

exactly latest kernel.

I will retest it with stress-ng as soon as I finish this email and copy
the dmesg errors.

> In my experience, the most stable kernels on the older SPARCs are still the
> 4.19 kernels. Thus, we should start bisecting to find out what commit actually
> made the kernel unreliable on these older SPARCs.


We must find a good way to test though. I stress-tested the 5.9 kernel
further. The system sometimes seemed unresponsive, but eventually
recovered (some errors to know more pasted below). Thus I would consider
it "stable". I did run several small burst of tests and then a session
given of 30m minutes but that due to hiccups lasted more like 2 hours,
but afterwards, the machine was still up.

 sudo stress-ng --all 10 --timeout 30m

10 times means more than physical CPUs, but less than logical cores
(32). The system has 16GB of ram, I see some OOMs in dmesg, I wonder if
this is due to certain stress tests specifically going against any limit.

[16195.300448] Unable to handle kernel NULL pointer dereference in mna
handler
[16195.741725]  40014fef
[16195.741793]  at virtual address 00e7
[16195.767936]  b416801c
[16195.767945]  c2592468
[16195.767990] current->{active_,}mm->context = 0bb8
[16195.768848]  b818
[16195.768857]  920126c8
[16195.769673] current->{active_,}mm->pgd = 800089cac000

[16195.770413]   \|/  \|/
 "@'/ .. \`@"
 /_| \__/ |_\
\__U_/
[16196.30] systemd-journald[219777]: /dev/kmsg buffer overrun, some
messages lost.
[16196.304235] stress-ng(234874): Oops [#864]
[16196.304262] CPU: 8 PID: 234874 Comm: stress-ng Tainted: G  D
E  X  5.9.0-5-sparc64-smp #1 Debian 5.9.15-1
[16196.304281] TSTATE: 008811001605 TPC: 0042d8e0 TNPC:
0042d8e4 Y: Tainted: G  D E  X
[16196.304311] TPC: 
[16196.304327] g0: 0040770c g1: 032f g2:
 g3: 80010007c000
[16196.304341] g4: 8003f13f9240 g5: 8003fdaa4000 g6:
800087df8000 g7: 4000
[16196.304355] o0: 01ef o1: 032f o2:
800087df8000 o3: 0007
[16196.304368] o4: 0007 o5: fff2 sp:
800087dfb451 ret_pc: 0042d8c4
[16196.304390] RPC: 
[16196.304404] l0: 030800010304 l1: 0044f0001201 l2:
0040770c l3: 
[16196.304418] l4:  l5: 80010007c000 l6:
800087df8000 l7: 11001002
[16196.304432] i0: 0077 i1: 020f i2:
fff2 i3: 800187dfff70
[16196.304445] i4: fff2 i5: 0007 i6:
800087dfb4d1 i7: 0042d6fc
[16196.304472] I7: 
[16205.284863] aes_sparc64: sparc64 aes opcodes not available.
[16205.753417] Call Trace:
[16205.753453] [<0042d6fc>] do_signal+0x25c/0x560
[16205.753478] [<0042e218>] do_notify_resume+0x58/0xa0
[16205.753500] [<00404b48>] __handle_signal+0xc/0x30
[16205.753525] Caller[0042d6fc]: do_signal+0x25c/0x560
[16205.753546] Caller[0042e218]: do_notify_resume+0x58/0xa0
[16205.753562] Caller[00404b48]: __handle_signal+0xc/0x30
[16205.753575] Caller[0107294c]: 0x107294c
[16205.753580] Instruction DUMP:
[16205.753587]  c029a00d
[16205.753595]  b4168008
[16205.753602]  900761e8
[16205.753610] 
[16205.753616]  40014fef
[16205.753623]  b416801c
[16205.753629]  c2592468
[16205.753636]  b818
[16205.753644]  920126c8


then also these messages. I think they explain the "slowness" and
apparent freeze of the system - I was about to power-cycle but waited
and it recovered:

[16253.233924] ata1.00: qc timeout (cmd 0xa0)
[16335.213786] PM: hibernation: Basic memory bitmaps created
[16830.619976] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[16830.620193]  (detected by 18, t=5252 jiffies, g=711181, q=6)
[16830.620215] rcu: All QSes seen, last rcu_sched kthread activity 1191
(4299098242-4299097051), jiffies_till_next_fqs=1, root ->qsmask 0x0
[16830.620491] rcu: rcu_sched kthread starved for 1191 jiffies! g711181
f0x2 RCU_GP_CLEANUP(7) ->state=0x0 ->cpu=30
[16830.620749] rcu: Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[16830.620844] rcu: RCU grace-period kthread stack dump:
[16830.621069] task:rcu_sched   state:R  running task stack:
0 pid:   10 ppid: 2 flags:0x0500
[16830.621095] Call Trace:
[16830.621128] [<00bda560>] _cond_resched+0x40/0x60
[16830.621153] [<004ee1d0>] rcu_gp_kthread+0x9b0/0xe40
[16830.621175] [<00491c48>] kthread+0x108/0x120
[16830.621205] [<004060c8>] ret_from_fork+0x1c/0x2c
[16830.621224]