Re: Machine Check in P2010(e500v2)

Joakim Tjernlund Thu, 07 Sep 2017 01:42:08 -0700

On Thu, 2017-09-07 at 00:50 +0200, Joakim Tjernlund wrote:
> On Wed, 2017-09-06 at 21:13 +0000, Leo Li wrote:
> > > -----Original Message-----
> > > From: Joakim Tjernlund [mailto:[email protected]]
> > > Sent: Wednesday, September 06, 2017 3:54 PM
> > > To: [email protected]; Leo Li <[email protected]>; York Sun
> > > <[email protected]>
> > > Subject: Re: Machine Check in P2010(e500v2)
> > > 
> > > On Wed, 2017-09-06 at 20:28 +0000, Leo Li wrote:
> > > > > -----Original Message-----
> > > > > From: Joakim Tjernlund [mailto:[email protected]]
> > > > > Sent: Wednesday, September 06, 2017 3:17 PM
> > > > > To: [email protected]; Leo Li <[email protected]>; York
> > > > > Sun <[email protected]>
> > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > 
> > > > > On Wed, 2017-09-06 at 19:31 +0000, Leo Li wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: York Sun
> > > > > > > Sent: Wednesday, September 06, 2017 10:38 AM
> > > > > > > To: Joakim Tjernlund <[email protected]>; linuxppc-
> > > > > > > [email protected]; Leo Li <[email protected]>
> > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > 
> > > > > > > Scott is no longer with Freescale/NXP. Adding Leo.
> > > > > > > 
> > > > > > > On 09/05/2017 01:40 AM, Joakim Tjernlund wrote:
> > > > > > > > So after some debugging I found this bug:
> > > > > > > > @@ -996,7 +998,7 @@ int fsl_pci_mcheck_exception(struct pt_regs
> > > 
> > > *regs)
> > > > > > > >          if (is_in_pci_mem_space(addr)) {
> > > > > > > >                  if (user_mode(regs)) {
> > > > > > > >                          pagefault_disable();
> > > > > > > > -                       ret = get_user(regs->nip, &inst);
> > > > > > > > +                       ret = get_user(inst, (__u32 __user
> > > > > > > > + *)regs->nip);
> > > > > > > >                          pagefault_enable();
> > > > > > > >                  } else {
> > > > > > > >                          ret = probe_kernel_address(regs->nip,
> > > > > > > > inst);
> > > > > > > > 
> > > > > > > > However, the kernel still locked up after fixing that.
> > > > > > > > Now I wonder why this fixup is there in the first place? The
> > > > > > > > routine will not really fixup the insn, just return 0xffffffff
> > > > > > > > for the failing read and then advance the process NIP.
> > > > > > 
> > > > > > You are right.  The code here only gives 0xffffffff to the load
> > > > > > instructions and
> > > > > 
> > > > > continue with the next instruction when the load instruction is
> > > > > causing the machine check.  This will prevent a system lockup when
> > > > > reading from PCI/RapidIO device which is link down.
> > > > > > 
> > > > > > I don't know what is actual problem in your case.  Maybe it is a
> > > > > > write
> > > > > 
> > > > > instruction instead of read?   Or the code is in a infinite loop 
> > > > > waiting for a
> > > 
> > > valid
> > > > > read result?  Are you able to do some further debugging with the NIP
> > > > > correctly printed?
> > > > > > 
> > > > > 
> > > > > According to the MC it is a Read and the NIP also leads to a read in 
> > > > > the
> > > 
> > > program.
> > > > > ATM, I have disabled the fixup but I will enable that again.
> > > > > Question, is it safe add a small printk when this MC happens(after
> > > > > fixing up)? I need to see that it has happened as the error is 
> > > > > somewhat
> > > 
> > > random.
> > > > 
> > > > I think it is safe to add printk as the current machine check handlers 
> > > > are also
> > > 
> > > using printk.
> > > 
> > > I hope so, but if the fixup fires there is no printk at all so I was a 
> > > bit unsure.
> > > Don't like this fixup though, is there not a better way than faking a 
> > > read to user
> > > space(or kernel for that matter) ?
> > 
> > I don't have a better idea.  Without the fixup, the offending load 
> > instruction will never finish if there is anything wrong with the backing 
> > device and freeze the whole system.  Do you have any suggestion in mind?
> > 
> 
> But it never finishes the load, it just fakes a load of 0xfffffffff, for user 
> space I rather have it signal
> a SIGBUS but that does not seem to work either, at least not for us but that 
> could be a bug in general MC code
>  maybe.
> This fixup might be valid for kernel only as it has never worked for user 
> space due to the bug I found.
> 
> Where can I read about this errata ?


I have look high and low an cannot find an errata which maps to this fixup.
The closest I get is A-005125 which seems to have another workaround, I cannot 
find
any evidence that this workaround has been applied in Linux, can you?

 Jocke

Re: Machine Check in P2010(e500v2)

Reply via email to