I am seeing random crashes(at least to me) with powerpc/selftests on P10 LPAR running powerpc/merge branch code. mitigation-patching.sh test was running in both the instances.
In the latest instance it seems like a possible stack corruption ?? [ 711.005150] count-cache-flush: hardware flush enabled. [ 711.005153] link-stack-flush: software flush enabled. [ 711.015306] barrier-nospec: using ORI speculation barrier [ 711.030889] kernel tried to execute exec-protected page (c00000000a70fc80) - exploit attempt? (uid: 0) [ 711.030902] BUG: Unable to handle kernel instruction fetch [ 711.030905] Faulting instruction address: 0xc00000000a70fc80 [ 711.030909] Thread overran stack, or stack corrupted [ 711.030913] Oops: Kernel access of bad area, sig: 11 [#1] [ 711.030917] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries [ 711.030924] Modules linked in: dm_mod nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables bonding libcrc32c nfnetlink sunrpc pseries_rng xts vmx_crypto sch_fq_codel ext4 mbcache jbd2 sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp fuse [ 711.030960] CPU: 31 PID: 165 Comm: migration/31 Not tainted 5.17.0-ge8833c5edc59 #1 [ 711.030965] Stopper: multi_cpu_stop+0x0/0x230 <- stop_machine_cpuslocked+0x188/0x1e0 [ 711.030977] NIP: c00000000a70fc80 LR: c00000000a70fc80 CTR: c000000000293f90 [ 711.030981] REGS: c00000000a70f9a0 TRAP: 0400 Not tainted (5.17.0-ge8833c5edc59) [ 711.030986] MSR: 800000001280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48002822 XER: 00000000 [ 711.031001] CFAR: c000000000216628 IRQMASK: 0 [ 711.031001] GPR00: c00000000a70fc80 c00000000a70fc40 c000000002a1fe00 0000000000c57415 [ 711.031001] GPR04: 0000000000000000 c000000efa36ab80 c000000efa36ab70 c00000000001e688 [ 711.031001] GPR08: 0000000000000000 c000000efa3ef480 0000000000000000 c000000efa3ee600 [ 711.031001] GPR12: 0000000000000000 c000000effbe5a80 c00000000018fc98 c0000000072a5f80 [ 711.031001] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 711.031001] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 711.031001] GPR24: 0000000000000001 0000000000000002 0000000000000003 c000000002a62138 [ 711.031001] GPR28: c00000024224fb08 0000000000000001 c00000024224fb2c 0000000000000001 [ 711.031054] NIP [c00000000a70fc80] 0xc00000000a70fc80 [ 711.031058] LR [c00000000a70fc80] 0xc00000000a70fc80 [ 711.031062] Call Trace: [ 711.031065] [c00000000a70fc40] [c00000000a70fc80] 0xc00000000a70fc80 (unreliable) [ 711.031071] [c00000000a70fcb0] [c000000000293ce4] cpu_stopper_thread+0xe4/0x240 [ 711.031077] [c00000000a70fd60] [0000000119a59724] 0x119a59724 [ 711.031083] BUG: Unable to handle kernel data access on read at 0xc0000014ffffc000 [ 711.031088] Faulting instruction address: 0xc00000000001ccfc [ 711.031091] Thread overran stack, or stack corrupted [ 711.031093] Oops: Kernel access of bad area, sig: 11 [#2] [ 711.031097] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries [ 711.031101] Modules linked in: dm_mod nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables bonding libcrc32c nfnetlink sunrpc pseries_rng xts vmx_crypto sch_fq_codel ext4 mbcache jbd2 sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp fuse [ 711.031128] CPU: 31 PID: 165 Comm: Not tainted 5.17.0-ge8833c5edc59 #1 [ 711.031134] BUG: Unable to handle kernel data access at 0xc10000000214ab60 [ 711.031138] Faulting instruction address: 0xc000000000293e70 [ 711.031141] Thread overran stack, or stack corrupted [ 711.031144] Oops: Kernel access of bad area, sig: 11 [#3] ……….. ……….. In another instance I saw following crash in ibmveth [ 714.823524] count-cache-flush: hardware flush enabled. [ 714.823528] link-stack-flush: software flush enabled. [ 714.828529] barrier-nospec: using ORI speculation barrier [ 715.181552] ------------[ cut here ]------------ [ 715.181558] kernel BUG at drivers/net/ethernet/ibm/ibmveth.c:402! [ 715.181563] Oops: Exception in kernel mode, sig: 5 [#1] [ 715.181568] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries [ 715.181572] Modules linked in: dm_mod nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables bonding libcrc32c nfnetlink sunrpc pseries_rng xts vmx_crypto sch_fq_codel ext4 mbcache jbd2 sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp fuse [ 715.181604] CPU: 0 PID: 12 Comm: migration/0 Not tainted 5.17.0-ge8833c5edc59 #1 [ 715.181609] Stopper: multi_cpu_stop+0x0/0x230 <- stop_machine_cpuslocked+0x188/0x1e0 [ 715.181620] NIP: c008000000a91fdc LR: c000000000aca5d4 CTR: c008000000a91e48 [ 715.181624] REGS: c00000000772f300 TRAP: 0700 Not tainted (5.17.0-ge8833c5edc59) [ 715.181628] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 42004422 XER: 00000000 [ 715.181640] CFAR: c008000000a91f14 IRQMASK: 0 [ 715.181640] GPR00: c000000000aca5d4 c00000000772f5a0 c008000000ac8000 c00000003a4c0a10 [ 715.181640] GPR04: 0000000000000010 000000002d890000 000000012d890000 0000000000000001 [ 715.181640] GPR08: c00000003a4c0a90 c00000005f4135a4 0000000000000000 c008000000a94858 [ 715.181640] GPR12: 0000000000004000 c000000002d20000 c00000000018fc98 c00000003a4c0a10 [ 715.181640] GPR16: 0000000000000101 0000000000000000 00000000000086dd 0000000000000004 [ 715.181640] GPR20: 000000000000dd86 0000000000000000 0000000000000080 000000000000003c [ 715.181640] GPR24: 000000000000003c 0000000000000080 c00000003a4c0a00 0000000000000010 [ 715.181640] GPR28: 000000000000003c 0000000000000000 0000000000000000 c00000003a4c0000 [ 715.181695] NIP [c008000000a91fdc] ibmveth_poll+0x194/0x860 [ibmveth] [ 715.181703] LR [c000000000aca5d4] __napi_poll+0x64/0x300 [ 715.181709] Call Trace: [ 715.181711] [c00000000772f5a0] [c00000000772f5e0] 0xc00000000772f5e0 (unreliable) [ 715.181718] [c00000000772f6a0] [c000000000aca5d4] __napi_poll+0x64/0x300 [ 715.181723] [c00000000772f720] [c000000000acadfc] net_rx_action+0x33c/0x3f0 [ 715.181729] [c00000000772f7e0] [c000000000d21a9c] __do_softirq+0x15c/0x3d0 [ 715.181737] [c00000000772f8d0] [c00000000015ecf8] irq_exit+0x178/0x1c0 [ 715.181743] [c00000000772f900] [c0000000000168fc] do_IRQ+0xfc/0x280 [ 715.181749] [c00000000772f930] [c0000000000090e8] hardware_interrupt_common_virt+0x218/0x220 [ 715.181757] --- interrupt: 500 at stop_machine_yield+0x8/0x10 [ 715.181762] NIP: c000000000293f88 LR: c0000000002940d8 CTR: c000000000293f90 [ 715.181766] REGS: c00000000772f9a0 TRAP: 0500 Not tainted (5.17.0-ge8833c5edc59) [ 715.181770] MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48004422 XER: 00000000 [ 715.181783] CFAR: 0000000000000000 IRQMASK: 0 [ 715.181783] GPR00: c0000000002940fc c00000000772fc40 c000000002a1fe00 c000000002a62138 [ 715.181783] GPR04: 0000000000000000 c000000ef900ab80 c000000ef900ab70 c00000000001e688 [ 715.181783] GPR08: 0000000000000000 c000000ef908f480 0000000000000000 000000000098967f [ 715.181783] GPR12: 0000000000000000 c000000002d20000 c00000000018fc98 c0000000072a0f80 [ 715.181783] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 715.181783] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 715.181783] GPR24: 0000000000000001 0000000000000002 0000000000000003 c000000002a62138 [ 715.181783] GPR28: c00000024119faf8 0000000000000001 c00000024119fb1c 0000000000000001 [ 715.181836] NIP [c000000000293f88] stop_machine_yield+0x8/0x10 [ 715.181841] LR [c0000000002940d8] multi_cpu_stop+0x148/0x230 [ 715.181845] --- interrupt: 500 [ 715.181847] [c00000000772fc40] [c0000000002940fc] multi_cpu_stop+0x16c/0x230 (unreliable) [ 715.181854] [c00000000772fcb0] [c000000000293ce4] cpu_stopper_thread+0xe4/0x240 [ 715.181859] [c00000000772fd60] [c000000000196114] smpboot_thread_fn+0x1e4/0x250 [ 715.181866] [c00000000772fdc0] [c00000000018fdb4] kthread+0x124/0x130 [ 715.181871] [c00000000772fe10] [c00000000000cf04] ret_from_kernel_thread+0x5c/0x64 [ 715.181877] Instruction dump: [ 715.181880] 7ce89850 7b980020 7f9707b4 78e70fe0 0b070000 79083e24 78c50020 7d0f4214 [ 715.181890] 80e801b8 7ce72850 78e70fe0 68e70001 <0b070000> 2e2a0000 e94801e8 78c61f48 [ 715.181901] ---[ end trace 0000000000000000 ]— The kernel eventually panics. I have not been able to reliably recreate these crashes. Have attached the relevant dmesg and crash logs from both the instances (merge-crash-1.txt & merge-crash-2.txt) - Sachin
merge-crash-1.txt.gz
Description: GNU Zip compressed data
merge-crash-2.txt.gz
Description: GNU Zip compressed data