[powerpc/merge] Possible stack corruption while running selftests

Sachin Sant Thu, 24 Mar 2022 00:00:07 -0700

I am seeing random crashes(at least to me) with powerpc/selftests on P10 LPAR
running powerpc/merge branch code.  mitigation-patching.sh test was running
in both the instances.


In the latest instance it seems like a possible stack corruption ??

[  711.005150] count-cache-flush: hardware flush enabled.
[  711.005153] link-stack-flush: software flush enabled.
[  711.015306] barrier-nospec: using ORI speculation barrier
[  711.030889] kernel tried to execute exec-protected page (c00000000a70fc80) - 
exploit attempt? (uid: 0)
[  711.030902] BUG: Unable to handle kernel instruction fetch
[  711.030905] Faulting instruction address: 0xc00000000a70fc80
[  711.030909] Thread overran stack, or stack corrupted
[  711.030913] Oops: Kernel access of bad area, sig: 11 [#1]
[  711.030917] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[  711.030924] Modules linked in: dm_mod nft_ct nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 ip_set rfkill nf_tables bonding libcrc32c nfnetlink sunrpc 
pseries_rng xts vmx_crypto sch_fq_codel ext4 mbcache jbd2 sd_mod t10_pi sg 
ibmvscsi ibmveth scsi_transport_srp fuse
[  711.030960] CPU: 31 PID: 165 Comm: migration/31 Not tainted 
5.17.0-ge8833c5edc59 #1
[  711.030965] Stopper: multi_cpu_stop+0x0/0x230 <- 
stop_machine_cpuslocked+0x188/0x1e0
[  711.030977] NIP:  c00000000a70fc80 LR: c00000000a70fc80 CTR: c000000000293f90
[  711.030981] REGS: c00000000a70f9a0 TRAP: 0400   Not tainted  
(5.17.0-ge8833c5edc59)
[  711.030986] MSR:  800000001280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 
48002822  XER: 00000000
[  711.031001] CFAR: c000000000216628 IRQMASK: 0
[  711.031001] GPR00: c00000000a70fc80 c00000000a70fc40 c000000002a1fe00 
0000000000c57415
[  711.031001] GPR04: 0000000000000000 c000000efa36ab80 c000000efa36ab70 
c00000000001e688
[  711.031001] GPR08: 0000000000000000 c000000efa3ef480 0000000000000000 
c000000efa3ee600
[  711.031001] GPR12: 0000000000000000 c000000effbe5a80 c00000000018fc98 
c0000000072a5f80
[  711.031001] GPR16: 0000000000000000 0000000000000000 0000000000000000 
0000000000000000
[  711.031001] GPR20: 0000000000000000 0000000000000000 0000000000000000 
0000000000000000
[  711.031001] GPR24: 0000000000000001 0000000000000002 0000000000000003 
c000000002a62138
[  711.031001] GPR28: c00000024224fb08 0000000000000001 c00000024224fb2c 
0000000000000001
[  711.031054] NIP [c00000000a70fc80] 0xc00000000a70fc80
[  711.031058] LR [c00000000a70fc80] 0xc00000000a70fc80
[  711.031062] Call Trace:
[  711.031065] [c00000000a70fc40] [c00000000a70fc80] 0xc00000000a70fc80 
(unreliable)
[  711.031071] [c00000000a70fcb0] [c000000000293ce4] 
cpu_stopper_thread+0xe4/0x240
[  711.031077] [c00000000a70fd60] [0000000119a59724] 0x119a59724
[  711.031083] BUG: Unable to handle kernel data access on read at 
0xc0000014ffffc000
[  711.031088] Faulting instruction address: 0xc00000000001ccfc
[  711.031091] Thread overran stack, or stack corrupted
[  711.031093] Oops: Kernel access of bad area, sig: 11 [#2]
[  711.031097] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[  711.031101] Modules linked in: dm_mod nft_ct nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 ip_set rfkill nf_tables bonding libcrc32c nfnetlink sunrpc 
pseries_rng xts vmx_crypto sch_fq_codel ext4 mbcache jbd2 sd_mod t10_pi sg 
ibmvscsi ibmveth scsi_transport_srp fuse
[  711.031128] CPU: 31 PID: 165 Comm:  Not tainted 5.17.0-ge8833c5edc59 #1
[  711.031134] BUG: Unable to handle kernel data access at 0xc10000000214ab60
[  711.031138] Faulting instruction address: 0xc000000000293e70
[  711.031141] Thread overran stack, or stack corrupted
[  711.031144] Oops: Kernel access of bad area, sig: 11 [#3]
………..
………..

In another instance I saw following crash in ibmveth

[  714.823524] count-cache-flush: hardware flush enabled.
[  714.823528] link-stack-flush: software flush enabled.
[  714.828529] barrier-nospec: using ORI speculation barrier
[  715.181552] ------------[ cut here ]------------
[  715.181558] kernel BUG at drivers/net/ethernet/ibm/ibmveth.c:402!
[  715.181563] Oops: Exception in kernel mode, sig: 5 [#1]
[  715.181568] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[  715.181572] Modules linked in: dm_mod nft_ct nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 ip_set rfkill nf_tables bonding libcrc32c nfnetlink sunrpc 
pseries_rng xts vmx_crypto sch_fq_codel ext4 mbcache jbd2 sd_mod t10_pi sg 
ibmvscsi ibmveth scsi_transport_srp fuse
[  715.181604] CPU: 0 PID: 12 Comm: migration/0 Not tainted 
5.17.0-ge8833c5edc59 #1
[  715.181609] Stopper: multi_cpu_stop+0x0/0x230 <- 
stop_machine_cpuslocked+0x188/0x1e0
[  715.181620] NIP:  c008000000a91fdc LR: c000000000aca5d4 CTR: c008000000a91e48
[  715.181624] REGS: c00000000772f300 TRAP: 0700   Not tainted  
(5.17.0-ge8833c5edc59)
[  715.181628] MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 42004422  
XER: 00000000
[  715.181640] CFAR: c008000000a91f14 IRQMASK: 0
[  715.181640] GPR00: c000000000aca5d4 c00000000772f5a0 c008000000ac8000 
c00000003a4c0a10
[  715.181640] GPR04: 0000000000000010 000000002d890000 000000012d890000 
0000000000000001
[  715.181640] GPR08: c00000003a4c0a90 c00000005f4135a4 0000000000000000 
c008000000a94858
[  715.181640] GPR12: 0000000000004000 c000000002d20000 c00000000018fc98 
c00000003a4c0a10
[  715.181640] GPR16: 0000000000000101 0000000000000000 00000000000086dd 
0000000000000004
[  715.181640] GPR20: 000000000000dd86 0000000000000000 0000000000000080 
000000000000003c
[  715.181640] GPR24: 000000000000003c 0000000000000080 c00000003a4c0a00 
0000000000000010
[  715.181640] GPR28: 000000000000003c 0000000000000000 0000000000000000 
c00000003a4c0000
[  715.181695] NIP [c008000000a91fdc] ibmveth_poll+0x194/0x860 [ibmveth]
[  715.181703] LR [c000000000aca5d4] __napi_poll+0x64/0x300
[  715.181709] Call Trace:
[  715.181711] [c00000000772f5a0] [c00000000772f5e0] 0xc00000000772f5e0 
(unreliable)
[  715.181718] [c00000000772f6a0] [c000000000aca5d4] __napi_poll+0x64/0x300
[  715.181723] [c00000000772f720] [c000000000acadfc] net_rx_action+0x33c/0x3f0
[  715.181729] [c00000000772f7e0] [c000000000d21a9c] __do_softirq+0x15c/0x3d0
[  715.181737] [c00000000772f8d0] [c00000000015ecf8] irq_exit+0x178/0x1c0
[  715.181743] [c00000000772f900] [c0000000000168fc] do_IRQ+0xfc/0x280
[  715.181749] [c00000000772f930] [c0000000000090e8] 
hardware_interrupt_common_virt+0x218/0x220
[  715.181757] --- interrupt: 500 at stop_machine_yield+0x8/0x10
[  715.181762] NIP:  c000000000293f88 LR: c0000000002940d8 CTR: c000000000293f90
[  715.181766] REGS: c00000000772f9a0 TRAP: 0500   Not tainted  
(5.17.0-ge8833c5edc59)
[  715.181770] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 
48004422  XER: 00000000
[  715.181783] CFAR: 0000000000000000 IRQMASK: 0
[  715.181783] GPR00: c0000000002940fc c00000000772fc40 c000000002a1fe00 
c000000002a62138
[  715.181783] GPR04: 0000000000000000 c000000ef900ab80 c000000ef900ab70 
c00000000001e688
[  715.181783] GPR08: 0000000000000000 c000000ef908f480 0000000000000000 
000000000098967f
[  715.181783] GPR12: 0000000000000000 c000000002d20000 c00000000018fc98 
c0000000072a0f80
[  715.181783] GPR16: 0000000000000000 0000000000000000 0000000000000000 
0000000000000000
[  715.181783] GPR20: 0000000000000000 0000000000000000 0000000000000000 
0000000000000000
[  715.181783] GPR24: 0000000000000001 0000000000000002 0000000000000003 
c000000002a62138
[  715.181783] GPR28: c00000024119faf8 0000000000000001 c00000024119fb1c 
0000000000000001
[  715.181836] NIP [c000000000293f88] stop_machine_yield+0x8/0x10
[  715.181841] LR [c0000000002940d8] multi_cpu_stop+0x148/0x230
[  715.181845] --- interrupt: 500
[  715.181847] [c00000000772fc40] [c0000000002940fc] multi_cpu_stop+0x16c/0x230 
(unreliable)
[  715.181854] [c00000000772fcb0] [c000000000293ce4] 
cpu_stopper_thread+0xe4/0x240
[  715.181859] [c00000000772fd60] [c000000000196114] 
smpboot_thread_fn+0x1e4/0x250
[  715.181866] [c00000000772fdc0] [c00000000018fdb4] kthread+0x124/0x130
[  715.181871] [c00000000772fe10] [c00000000000cf04] 
ret_from_kernel_thread+0x5c/0x64
[  715.181877] Instruction dump:
[  715.181880] 7ce89850 7b980020 7f9707b4 78e70fe0 0b070000 79083e24 78c50020 
7d0f4214
[  715.181890] 80e801b8 7ce72850 78e70fe0 68e70001 <0b070000> 2e2a0000 e94801e8 
78c61f48
[  715.181901] ---[ end trace 0000000000000000 ]—

The kernel eventually panics.

I have not been able to reliably recreate these crashes.

Have attached the relevant dmesg and crash logs from both the instances
(merge-crash-1.txt & merge-crash-2.txt)

- Sachin

merge-crash-1.txt.gz
Description: GNU Zip compressed data

merge-crash-2.txt.gz
Description: GNU Zip compressed data

[powerpc/merge] Possible stack corruption while running selftests

Reply via email to