So, here's a status report, sorry for the long mail (summary at the end):
I did all tests on my Sun Fire V215 for now, because the machine is a bit faster than the Ultra 10 and the ALOM makes remote testing a little bit easier. It also has two CPUs, helping to uncover SMP-related issues. Kernel 4.19 indeed seems to run more stable than later Debian kernels. That at least gives me stable system to fix things when they break. Thanks for that hint. My first attempt at bisecting ran into lots of compilation issues with the default config of each version and gcc 14. All the 4.x and 5.x kernels fail with the following errors (at least, some versions have more): arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name': arch/sparc/kernel/mdesc.c:646:22: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread] 646 | if (!strcmp(names + ep[ret].name_offset, name)) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ arch/sparc/kernel/mdesc.c:78:33: note: at offset [32, 8589934606] into source object 'mdesc' of size 16 78 | struct mdesc_hdr mdesc; | ^~~~~ ... In function 'kernel_lds_init', inlined from 'report_memory' at arch/sparc/mm/init_64.c:3112:2: arch/sparc/mm/init_64.c:3102:31: error: array subscript -1 is outside array bounds of 'char[]' [-Werror=array-bounds=] 3102 | data_resource.end = compute_kern_paddr(_edata - 1); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./include/asm-generic/sections.h: In function 'report_memory': ./include/asm-generic/sections.h:36:32: note: at offset -1 into object '_edata' of size [0, 9223372036854775807] 36 | extern char _data[], _sdata[], _edata[]; | ^~~~~~ ... I then tried gcc 8.1, which roughly matches 4.19 by the time of release. Older kernels compile well with this version, but 6.10 failed with these errors. I couldn't reproduce this error later on, so it may have been a fluke: `.exit.text' referenced in section `__jump_table' of fs/fuse/inode.o: defined in discarded section `.exit.text' of fs/fuse/inode.o `.exit.text' referenced in section `__jump_table' of fs/fuse/inode.o: defined in discarded section `.exit.text' of fs/fuse/inode.o I decided to ignore the error for now and start bisecting from 4.19 to 6.10 with gcc 8.1. Next issue: The default kernel config lacks some essential drivers to make my system bootable. For my Fire V215, at least CONFIG_FUSIONMPT and CONFIG_CGROUPS are needed, plus a few other things. systemd requires cgroups v2 support theses days. I started off with a default config in the first bisect step (corresponding with 5.14), added the required options, and then did a make oldconfig in each subsequent step, answering all questions with the default. Building with make bindeb-pkg produces an almost usable kernel package. For some reason, grub-ieee1275 requires an unpacked kernel, so the installed vmlinuz needed to be gunzipped afterwards. Now for the actual testing... triggering a panic/oops reliably was difficult. The Debian 6.10 kernel usually crashes relatively quickly on disk I/O, and enabling swap accelerates the effect. bonnie++ should therefore make for a good stress test. I don't have the exact commit IDs of each bisection step, but it was (roughly) 5.14-rc6, 6.6-rc7, 6.8-rc3, 6.9, 6.10. There were a few odd non-critical issues, such as this I/O error with 5.14 (but nothing in dmesg): $ /usr/sbin/bonnie++ Writing a byte at a time...done Writing intelligently...done Rewriting...Can't write block.: Unknown error 2560 Bonnie: drastic I/O error (re write(2)): Unknown error 2560 6.2 produces this warning at boot: [ +21.090317] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ +1.422401] rcu: 0-...!: (1 GPs behind) idle=a29c/0/0x1 softirq=18/19 fqs=44 [ +0.093960] (detected by 0, t=2246 jiffies, g=-1175, q=989 ncpus=2) [ +0.083646] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc7+ #18 [ +0.083641] TSTATE: 0000004411001605 TPC: 000000000042beac TNPC: 000000000042beb0 Y: 00000000 Not tainted [ +0.129479] TPC: <arch_cpu_idle+0x8c/0xa0> [ +0.053848] g0: 00000000004209d0 g1: 00000000015282c0 g2: 00000000015105c8 g3: 0000000000000001 [ +0.114585] g4: fff0000000390ba0 g5: fff000027e2f0000 g6: fff0000000398000 g7: 00000000173aa294 [ +0.114582] o0: fff0000000390ba0 o1: 0000000000000001 o2: 000000000130ae78 o3: 00000000015105c8 [ +0.114580] o4: 00000000015280c0 o5: 000000000130b580 sp: fff000000039b3d1 ret_pc: 000000000042bea0 [ +0.119164] RPC: <arch_cpu_idle+0x80/0xa0> [ +0.053850] l0: 0000000001407f20 l1: 0000000000022c05 l2: 0000000000000000 l3: 000000000130b538 [ +0.114585] l4: 000000000130b400 l5: 0000000000000040 l6: 0000000000000000 l7: 0000000001408140 [ +0.114581] i0: 00000000173aa299 i1: fff000027f814990 i2: 0000000000000001 i3: 0000000000000001 [ +0.114580] i4: fff000027f814990 i5: 0000000001524990 i6: fff000000039b481 i7: 0000000000b22f68 [ +0.114582] I7: <default_idle_call+0x48/0x100> [ +0.058433] Call Trace: [ +0.032082] [<0000000000b22f68>] default_idle_call+0x48/0x100 [ +0.075624] [<00000000004adc28>] do_idle+0x108/0x180 [ +0.065311] [<00000000004adf34>] cpu_startup_entry+0x14/0x40 [ +0.074477] [<000000000043ede4>] smp_callin+0xe4/0x120 [ +0.067603] [<0000000001318614>] 0x1318614 [ +0.053853] [<0000000040000000>] 0x40000000 It also failed to shut down properly: [ 1634.268777] systemd-journald[181]: Failed to send WATCHDOG=1 notification message: Connection refused [ 1754.268963] systemd-journald[181]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected The shutdown got stuck after that. I did not see this with any other kernels. From 6.2 onward, The tg3 network driver produces this warning at shutdown (but it proceeds from there without issue): [ 1594.751376] ------------[ cut here ]------------ [ 1594.812280] WARNING: CPU: 0 PID: 3914 at kernel/irq/msi.c:196 msi_domain_free_descs+0xdc/0x100 [ 1594.925813] Modules linked in: binfmt_misc flash sg fuse autofs4 dm_mod mptsas sr_mod scsi_transport_sas mptscsih ehci_pci mptbase tg3 cdrom ehci_hcd libphy [ 1595.110450] CPU: 0 PID: 3914 Comm: ip Not tainted 6.2.0-rc7+ #18 [ 1595.189586] Call Trace: [ 1595.221667] [<0000000000465da8>] __warn+0xe8/0x120 [ 1595.284686] [<0000000000b11088>] warn_slowpath_fmt+0x30/0x70 [ 1595.359165] [<00000000004cdbfc>] msi_domain_free_descs+0xdc/0x100 [ 1595.439371] [<00000000004ce878>] msi_domain_free_msi_descs_range+0x18/0x40 [ 1595.529891] [<0000000000819984>] pci_free_msi_irqs+0x4/0x20 [ 1595.603222] [<0000000000817e94>] pci_disable_msi+0x54/0x80 [ 1595.675408] [<00000000100b0464>] tg3_ints_fini+0x64/0xe0 [tg3] [ 1595.752282] [<00000000100c880c>] tg3_stop+0x22c/0x2c0 [tg3] [ 1595.825614] [<00000000100c88c0>] tg3_close+0x20/0xa0 [tg3] [ 1595.897799] [<000000000096c8e8>] __dev_close_many+0x88/0x100 [ 1595.972278] [<0000000000976c64>] __dev_change_flags+0xa4/0x1e0 [ 1596.049047] [<0000000000976db8>] dev_change_flags+0x18/0x60 [ 1596.122378] [<00000000009872a0>] do_setlink+0x2e0/0x1140 [ 1596.192273] [<000000000098d138>] __rtnl_newlink+0x3f8/0x7e0 [ 1596.265605] [<000000000098d550>] rtnl_newlink+0x30/0x60 [ 1596.334353] [<0000000000986a7c>] rtnetlink_rcv_msg+0x27c/0x360 [ 1596.411144] ---[ end trace 0000000000000000 ]--- On 6.6, I got this warning at boot: [ +21.089612] rcu: INFO: rcu_sched self-detected stall on CPU [ +0.000007] rcu: 1-....: (281 ticks this GP) idle=36cc/1/0x4000000000000002 softirq=28/28 fqs=1050 [ +0.000012] rcu: (t=2101 jiffies g=-1175 q=1029 ncpus=2) [ +0.000007] CPU: 1 PID: 1 Comm: swapper/1 Not tainted 6.6.0-rc7+ #19 [ +0.000008] TSTATE: 0000004411001602 TPC: 00000000004c23f0 TNPC: 00000000004c23f4 Y: 00001f91 Not tainted [ +0.000005] TPC: <console_flush_all+0x1d0/0x4a0> [ +0.000018] g0: 00000000004c23f0 g1: 000000000154bca0 g2: 0000000000000000 g3: 00000000016e1400 [ +0.000004] g4: fff0001004510000 g5: fff000103d2b6000 g6: fff0001004658000 g7: 000000000000000e [ +0.000004] o0: 00000000016e17f8 o1: 0000000000000000 o2: 0000000000000000 o3: 000000000000004d [ +0.000004] o4: 00000000016e0bd8 o5: 0000000001753250 sp: fff000100465a9c1 ret_pc: 00000000004c23e4 [ +0.000004] RPC: <console_flush_all+0x1c4/0x4a0> [ +0.000007] l0: 000000000133b078 l1: 0000000000000000 l2: 0000000000000000 l3: 0000000000000000 [ +0.000004] l4: 0000000001435400 l5: 0000000000000000 l6: 00000000016e0bd8 l7: 00000000014b0840 [ +0.000004] i0: 0000000000000000 i1: fff000100465b368 i2: fff000100465b367 i3: 00000000016e1400 [ +0.000004] i4: 00000000016e0bd8 i5: 00000000016e17f8 i6: fff000100465aab1 i7: 00000000004c2730 [ +0.000004] I7: <console_unlock+0x70/0xe0> [ +0.000008] Call Trace: [ +0.000003] [<00000000004c2730>] console_unlock+0x70/0xe0 [ +0.000007] [<00000000004c3c8c>] vprintk_emit+0x1cc/0x220 [ +0.000009] [<0000000000b32aa4>] _printk+0x24/0x34 [ +0.000014] [<00000000008851e8>] serial_core_register_port+0x468/0x6c0 [ +0.000007] [<0000000000888998>] su_probe+0x178/0x3c0 [ +0.000009] [<0000000000898fe8>] platform_probe+0x28/0x80 [ +0.000006] [<0000000000896bf8>] really_probe+0xb8/0x2e0 [ +0.000011] [<0000000000896f04>] driver_probe_device+0x24/0xe0 [ +0.000007] [<0000000000897104>] __driver_attach+0x64/0x120 [ +0.000007] [<0000000000894c10>] bus_for_each_dev+0x50/0xa0 [ +0.000007] [<0000000000895d3c>] bus_add_driver+0x17c/0x1e0 [ +0.000006] [<00000000008979d4>] driver_register+0x74/0x120 [ +0.000008] [<000000000151ab90>] sunsu_init+0x170/0x1d4 [ +0.000009] [<0000000000427bf4>] do_one_initcall+0x34/0x220 [ +0.000008] [<00000000014f8fb4>] kernel_init_freeable+0x210/0x274 [ +0.000012] [<0000000000b3c1bc>] kernel_init+0x18/0x13c On 6.6, I also found these messages in the kernel log (but apparently no negative consequences): [ +0.371437] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 [ +0.091825] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 [ +0.091734] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 [ +0.091763] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 [ +0.091757] Kernel unaligned access at TPC[4ea294] load_module+0xff4/0x1c60 [ +0.252176] log_unaligned: 4200 callbacks suppressed [ +0.055120] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20 [ +0.000023] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20 [ +0.000009] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20 [ +0.000009] Kernel unaligned access at TPC[4e75e0] cmp_name+0x0/0x20 Conclusion: It looks very much like it isn't specifically a kernel bug at all, but either something wrong with the Debian kernel config, or with newer gcc versions. I will test some other gcc versions next. Unfortunately, I couldn't test the config from the Debian linux-image-6.10.7-sparc64-smp package. Trying to build a kernel with this config produced a 700MB package, and the resulting initrd was too large to fit into my boot partition. Is there something special about how Debian builds kernel packages?