Mahesh J Salgaonkar's on June 21, 2019 9:47 pm: > On 2019-06-21 06:27:15 Fri, Santosh Sivaraj wrote: >> From: Reza Arbab <ar...@linux.ibm.com> >> >> Testing my memcpy_mcsafe() work in progress with an injected UE, I get >> an error like this immediately after the function returns: >> >> BUG: Unable to handle kernel data access at 0x7fff84dec8f8 >> Faulting instruction address: 0xc0080000009c00b0 >> Oops: Kernel access of bad area, sig: 11 [#1] >> LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV >> Modules linked in: mce(O+) vmx_crypto crc32c_vpmsum >> CPU: 0 PID: 1375 Comm: modprobe Tainted: G O 5.1.0-rc6 #267 >> NIP: c0080000009c00b0 LR: c0080000009c00a8 CTR: c000000000095f90 >> REGS: c0000000ee197790 TRAP: 0300 Tainted: G O (5.1.0-rc6) >> MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 88002826 >> XER: 00040000 >> CFAR: c000000000095f8c DAR: 00007fff84dec8f8 DSISR: 40000000 IRQMASK: 0 >> GPR00: 000000006c6c6568 c0000000ee197a20 c0080000009c8400 fffffffffffffff2 >> GPR04: c0080000009c02e0 0000000000000006 0000000000000000 c000000003c834c8 >> GPR08: 0080000000000000 776a6681b7fb5100 0000000000000000 c0080000009c01c8 >> GPR12: c000000000095f90 00007fff84debc00 000000004d071440 0000000000000000 >> GPR16: 0000000100000601 c0080000009e0000 c000000000c98dd8 c000000000c98d98 >> GPR20: c000000003bba970 c0080000009c04d0 c0080000009c0618 c0000000001e5820 >> GPR24: 0000000000000000 0000000000000100 0000000000000001 c000000003bba958 >> GPR28: c0080000009c02e8 c0080000009c0318 c0080000009c02e0 0000000000000000 >> NIP [c0080000009c00b0] cause_ue+0xa8/0xe8 [mce] >> LR [c0080000009c00a8] cause_ue+0xa0/0xe8 [mce] >> >> To fix, ensure that r13 is properly restored after an MCE. >> >> Signed-off-by: Reza Arbab <ar...@linux.ibm.com> >> --- >> arch/powerpc/kernel/exceptions-64s.S | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/arch/powerpc/kernel/exceptions-64s.S >> b/arch/powerpc/kernel/exceptions-64s.S >> index 311f1392a2ec..932d8d05892c 100644 >> --- a/arch/powerpc/kernel/exceptions-64s.S >> +++ b/arch/powerpc/kernel/exceptions-64s.S >> @@ -265,6 +265,7 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE) >> EXC_REAL_END(machine_check, 0x200, 0x100) >> EXC_VIRT_NONE(0x4200, 0x100) >> TRAMP_REAL_BEGIN(machine_check_common_early) >> + SET_SCRATCH0(r13) /* save r13 */ >> EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200) >> /* >> * Register contents: > > We do save r13 before we call machine_check_common_early(). I don't > think I understood clearly how this change fixes the issue you are > seeing. What am I missing here ? > > Above change will push the paca pointer to scratch0 overwriting the > original saved r13. > > EXC_REAL_BEGIN(machine_check, 0x200, 0x100) > /* This is moved out of line as it can be patched by FW, but > * some code path might still want to branch into the original > * vector > */ > SET_SCRATCH0(r13) /* save r13 */ > EXCEPTION_PROLOG_0(PACA_EXMC) > BEGIN_FTR_SECTION > b machine_check_common_early
Yep, from the stack trace, r13 is corrupted. So r13 must have got corrupted before the machine check and this just happens to have corrected it. How does cause_ue work? It or memcpy_mcsafe must be corrupting r13. Thanks, Nick