Re: Linux kernel stability fixes for older SPARCs

Gregor Riepl Wed, 18 Sep 2024 09:17:39 -0700

So, here's a status report, sorry for the long mail (summary at the end):


I did all tests on my Sun Fire V215 for now, because the machine is a bit 
faster than the Ultra 10 and the ALOM makes remote testing a little bit easier. 
It also has two CPUs, helping to uncover SMP-related issues.

Kernel 4.19 indeed seems to run more stable than later Debian kernels. That at 
least gives me stable system to fix things when they break. Thanks for that 
hint.

My first attempt at bisecting ran into lots of compilation issues with the 
default config of each version and gcc 14. All the 4.x and 5.x kernels fail 
with the following errors (at least, some versions have more):

arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name':
arch/sparc/kernel/mdesc.c:646:22: error: 'strcmp' reading 1 or more bytes from 
a region of size 0 [-Werror=stringop-overread]
  646 |                 if (!strcmp(names + ep[ret].name_offset, name))
      |                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/sparc/kernel/mdesc.c:78:33: note: at offset [32, 8589934606] into source 
object 'mdesc' of size 16
   78 |         struct mdesc_hdr        mdesc;
      |                                 ^~~~~
...
In function 'kernel_lds_init',
    inlined from 'report_memory' at arch/sparc/mm/init_64.c:3112:2:
arch/sparc/mm/init_64.c:3102:31: error: array subscript -1 is outside array 
bounds of 'char[]' [-Werror=array-bounds=]
 3102 |         data_resource.end   = compute_kern_paddr(_edata - 1);
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/asm-generic/sections.h: In function 'report_memory':
./include/asm-generic/sections.h:36:32: note: at offset -1 into object '_edata' 
of size [0, 9223372036854775807]
   36 | extern char _data[], _sdata[], _edata[];
      |                                ^~~~~~
...

I then tried gcc 8.1, which roughly matches 4.19 by the time of release. Older 
kernels compile well with this version, but 6.10 failed with these errors. I 
couldn't reproduce this error later on, so it may have been a fluke:

`.exit.text' referenced in section `__jump_table' of fs/fuse/inode.o: defined 
in discarded section `.exit.text' of fs/fuse/inode.o
`.exit.text' referenced in section `__jump_table' of fs/fuse/inode.o: defined 
in discarded section `.exit.text' of fs/fuse/inode.o

I decided to ignore the error for now and start bisecting from 4.19 to 6.10 
with gcc 8.1.

Next issue: The default kernel config lacks some essential drivers to make my 
system bootable. For my Fire V215, at least CONFIG_FUSIONMPT and CONFIG_CGROUPS 
are needed, plus a few other things. systemd requires cgroups v2 support theses 
days. I started off with a default config in the first bisect step 
(corresponding with 5.14), added the required options, and then did a make 
oldconfig in each subsequent step, answering all questions with the default.

Building with make bindeb-pkg produces an almost usable kernel package. For 
some reason, grub-ieee1275 requires an unpacked kernel, so the installed 
vmlinuz needed to be gunzipped afterwards.

Now for the actual testing... triggering a panic/oops reliably was difficult. 
The Debian 6.10 kernel usually crashes relatively quickly on disk I/O, and 
enabling swap accelerates the effect.
bonnie++ should therefore make for a good stress test.

I don't have the exact commit IDs of each bisection step, but it was (roughly) 
5.14-rc6, 6.6-rc7, 6.8-rc3, 6.9, 6.10.

There were a few odd non-critical issues, such as this I/O error with 5.14 (but 
nothing in dmesg):

$ /usr/sbin/bonnie++
Writing a byte at a time...done
Writing intelligently...done
Rewriting...Can't write block.: Unknown error 2560
Bonnie: drastic I/O error (re write(2)): Unknown error 2560

6.2 produces this warning at boot:

[ +21.090317] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  +1.422401] rcu:      0-...!: (1 GPs behind) idle=a29c/0/0x1 softirq=18/19 
fqs=44
[  +0.093960]   (detected by 0, t=2246 jiffies, g=-1175, q=989 ncpus=2)
[  +0.083646] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc7+ #18
[  +0.083641] TSTATE: 0000004411001605 TPC: 000000000042beac TNPC: 
000000000042beb0 Y: 00000000    Not tainted
[  +0.129479] TPC: <arch_cpu_idle+0x8c/0xa0>
[  +0.053848] g0: 00000000004209d0 g1: 00000000015282c0 g2: 00000000015105c8 
g3: 0000000000000001
[  +0.114585] g4: fff0000000390ba0 g5: fff000027e2f0000 g6: fff0000000398000 
g7: 00000000173aa294
[  +0.114582] o0: fff0000000390ba0 o1: 0000000000000001 o2: 000000000130ae78 
o3: 00000000015105c8
[  +0.114580] o4: 00000000015280c0 o5: 000000000130b580 sp: fff000000039b3d1 
ret_pc: 000000000042bea0
[  +0.119164] RPC: <arch_cpu_idle+0x80/0xa0>
[  +0.053850] l0: 0000000001407f20 l1: 0000000000022c05 l2: 0000000000000000 
l3: 000000000130b538
[  +0.114585] l4: 000000000130b400 l5: 0000000000000040 l6: 0000000000000000 
l7: 0000000001408140
[  +0.114581] i0: 00000000173aa299 i1: fff000027f814990 i2: 0000000000000001 
i3: 0000000000000001
[  +0.114580] i4: fff000027f814990 i5: 0000000001524990 i6: fff000000039b481 
i7: 0000000000b22f68
[  +0.114582] I7: <default_idle_call+0x48/0x100>
[  +0.058433] Call Trace:
[  +0.032082] [<0000000000b22f68>] default_idle_call+0x48/0x100
[  +0.075624] [<00000000004adc28>] do_idle+0x108/0x180
[  +0.065311] [<00000000004adf34>] cpu_startup_entry+0x14/0x40
[  +0.074477] [<000000000043ede4>] smp_callin+0xe4/0x120
[  +0.067603] [<0000000001318614>] 0x1318614
[  +0.053853] [<0000000040000000>] 0x40000000

It also failed to shut down properly:

[ 1634.268777] systemd-journald[181]: Failed to send WATCHDOG=1 notification 
message: Connection refused
[ 1754.268963] systemd-journald[181]: Failed to send WATCHDOG=1 notification 
message: Transport endpoint is not connected

The shutdown got stuck after that. I did not see this with any other kernels.

From 6.2 onward, The tg3 network driver produces this warning at shutdown (but 
it proceeds from there without issue):

[ 1594.751376] ------------[ cut here ]------------
[ 1594.812280] WARNING: CPU: 0 PID: 3914 at kernel/irq/msi.c:196 
msi_domain_free_descs+0xdc/0x100
[ 1594.925813] Modules linked in: binfmt_misc flash sg fuse autofs4 dm_mod 
mptsas sr_mod scsi_transport_sas mptscsih ehci_pci mptbase tg3 cdrom ehci_hcd 
libphy
[ 1595.110450] CPU: 0 PID: 3914 Comm: ip Not tainted 6.2.0-rc7+ #18
[ 1595.189586] Call Trace:
[ 1595.221667] [<0000000000465da8>] __warn+0xe8/0x120
[ 1595.284686] [<0000000000b11088>] warn_slowpath_fmt+0x30/0x70
[ 1595.359165] [<00000000004cdbfc>] msi_domain_free_descs+0xdc/0x100
[ 1595.439371] [<00000000004ce878>] msi_domain_free_msi_descs_range+0x18/0x40
[ 1595.529891] [<0000000000819984>] pci_free_msi_irqs+0x4/0x20
[ 1595.603222] [<0000000000817e94>] pci_disable_msi+0x54/0x80
[ 1595.675408] [<00000000100b0464>] tg3_ints_fini+0x64/0xe0 [tg3]
[ 1595.752282] [<00000000100c880c>] tg3_stop+0x22c/0x2c0 [tg3]
[ 1595.825614] [<00000000100c88c0>] tg3_close+0x20/0xa0 [tg3]
[ 1595.897799] [<000000000096c8e8>] __dev_close_many+0x88/0x100
[ 1595.972278] [<0000000000976c64>] __dev_change_flags+0xa4/0x1e0
[ 1596.049047] [<0000000000976db8>] dev_change_flags+0x18/0x60
[ 1596.122378] [<00000000009872a0>] do_setlink+0x2e0/0x1140
[ 1596.192273] [<000000000098d138>] __rtnl_newlink+0x3f8/0x7e0
[ 1596.265605] [<000000000098d550>] rtnl_newlink+0x30/0x60
[ 1596.334353] [<0000000000986a7c>] rtnetlink_rcv_msg+0x27c/0x360
[ 1596.411144] ---[ end trace 0000000000000000 ]---

On 6.6, I got this warning at boot:

[ +21.089612] rcu: INFO: rcu_sched self-detected stall on CPU
[  +0.000007] rcu:      1-....: (281 ticks this GP) 
idle=36cc/1/0x4000000000000002 softirq=28/28 fqs=1050
[  +0.000012] rcu:      (t=2101 jiffies g=-1175 q=1029 ncpus=2)
[  +0.000007] CPU: 1 PID: 1 Comm: swapper/1 Not tainted 6.6.0-rc7+ #19
[  +0.000008] TSTATE: 0000004411001602 TPC: 00000000004c23f0 TNPC: 
00000000004c23f4 Y: 00001f91    Not tainted
[  +0.000005] TPC: <console_flush_all+0x1d0/0x4a0>
[  +0.000018] g0: 00000000004c23f0 g1: 000000000154bca0 g2: 0000000000000000 
g3: 00000000016e1400
[  +0.000004] g4: fff0001004510000 g5: fff000103d2b6000 g6: fff0001004658000 
g7: 000000000000000e
[  +0.000004] o0: 00000000016e17f8 o1: 0000000000000000 o2: 0000000000000000 
o3: 000000000000004d
[  +0.000004] o4: 00000000016e0bd8 o5: 0000000001753250 sp: fff000100465a9c1 
ret_pc: 00000000004c23e4
[  +0.000004] RPC: <console_flush_all+0x1c4/0x4a0>
[  +0.000007] l0: 000000000133b078 l1: 0000000000000000 l2: 0000000000000000 
l3: 0000000000000000
[  +0.000004] l4: 0000000001435400 l5: 0000000000000000 l6: 00000000016e0bd8 
l7: 00000000014b0840
[  +0.000004] i0: 0000000000000000 i1: fff000100465b368 i2: fff000100465b367 
i3: 00000000016e1400
[  +0.000004] i4: 00000000016e0bd8 i5: 00000000016e17f8 i6: fff000100465aab1 
i7: 00000000004c2730
[  +0.000004] I7: <console_unlock+0x70/0xe0>
[  +0.000008] Call Trace:
[  +0.000003] [<00000000004c2730>] console_unlock+0x70/0xe0
[  +0.000007] [<00000000004c3c8c>] vprintk_emit+0x1cc/0x220
[  +0.000009] [<0000000000b32aa4>] _printk+0x24/0x34
[  +0.000014] [<00000000008851e8>] serial_core_register_port+0x468/0x6c0
[  +0.000007] [<0000000000888998>] su_probe+0x178/0x3c0
[  +0.000009] [<0000000000898fe8>] platform_probe+0x28/0x80
[  +0.000006] [<0000000000896bf8>] really_probe+0xb8/0x2e0
[  +0.000011] [<0000000000896f04>] driver_probe_device+0x24/0xe0
[  +0.000007] [<0000000000897104>] __driver_attach+0x64/0x120
[  +0.000007] [<0000000000894c10>] bus_for_each_dev+0x50/0xa0
[  +0.000007] [<0000000000895d3c>] bus_add_driver+0x17c/0x1e0
[  +0.000006] [<00000000008979d4>] driver_register+0x74/0x120
[  +0.000008] [<000000000151ab90>] sunsu_init+0x170/0x1d4
[  +0.000009] [<0000000000427bf4>] do_one_initcall+0x34/0x220
[  +0.000008] [<00000000014f8fb4>] kernel_init_freeable+0x210/0x274
[  +0.000012] [<0000000000b3c1bc>] kernel_init+0x18/0x13c

On 6.6, I also found these messages in the kernel log (but apparently no 
negative consequences):

[  +0.371437] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60
[  +0.091825] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60
[  +0.091734] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60
[  +0.091763] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60
[  +0.091757] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60
[  +0.252176] log_unaligned: 4200 callbacks suppressed
[  +0.055120] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20
[  +0.000023] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20
[  +0.000009] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20
[  +0.000009] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20


Conclusion:

It looks very much like it isn't specifically a kernel bug at all, but either 
something wrong with the Debian kernel config, or with newer gcc versions.

I will test some other gcc versions next.

Unfortunately, I couldn't test the config from the Debian 
linux-image-6.10.7-sparc64-smp package. Trying to build a kernel with this 
config produced a 700MB package, and the resulting initrd was too large to fit 
into my boot partition. Is there something special about how Debian builds 
kernel packages?

Re: Linux kernel stability fixes for older SPARCs

Reply via email to