On 02/19/2020 at 4:21 PM Christophe Leroy <christophe.le...@c-s.fr> wrote: > > Radu Rendec <radu.ren...@gmail.com> a écrit : > >> On 02/19/2020 at 10:11 AM Radu Rendec <radu.ren...@gmail.com> wrote: > >>> On 02/18/2020 at 1:08 PM Christophe Leroy <christophe.le...@c-s.fr> wrote: > >>>> Le 18/02/2020 à 18:07, Radu Rendec a écrit : > >>>> > The saved NIP seems to be broken inside machine_check_exception() on > >>>> > MPC8378, running Linux 4.9.191. The value is 0x900 most of the times, > >>>> > but I have seen other weird values. > >>>> > > >>>> > I've been able to track down the entry code to head_32.S (vector > >>>> > 0x200), > >>>> > but I'm not sure where/how the NIP value (where the exception occurred) > >>>> > is captured. > >>>> > >>>> NIP value is supposed to come from SRR0, loaded in r12 in PROLOG_2 and > >>>> saved into _NIP(r11) in transfer_to_handler in entry_32.S > >>>> > >>>> Can something clobber r12 at some point ? > >>>> > >>> > >>> I did something even simpler: I added the following > >>> > >>> lis r12,0x1234 > >>> > >>> ... right after > >>> > >>> mfspr r12,SPRN_SRR0 > >>> > >>> ... and now the NIP value I see in the crash dump is 0x12340000. This > >>> means r12 is not clobbered and most likely the NIP value I normally see > >>> is the actual SRR0 value. > >> > >> I apologize for the noise. I just found out accidentally that the saved > >> NIP value is correct if interrupts are disabled at the time when the > >> faulty access that triggers the MCE occurs. This seems to happen > >> consistently. > >> > >> By "interrupts are disabled" I mean local_irq_save/local_irq_restore, so > >> it's basically enough to wrap ioread32 to get the NIP value right. > >> > >> Does this make any sense? Maybe it's not a silicon bug after all, or > >> maybe it is and I just found a workaround. Could this happen on other > >> PowerPC CPUs as well? > > > > Interesting. > > > > 0x900 is the adress of the timer interrupt. > > > > Would the MCE occur just after the timer interrupt ?
I doubt that. I'm using a small test module to artificially trigger the MCE. Basically it's just this (the full code is in my original post): bad_addr_base = ioremap(0xf0000000, 0x100); x = ioread32(bad_addr_base); I find it hard to believe that every time I load the module the lwbrx instruction that triggers the MCE is executed exactly after the timer interrupt (or that the timer interrupt always occurs close to the lwbrx instruction). > > > > Can you tell how are configured your IO busses, etc ... ? Nothing special. The device tree is mostly similar to mpc8379_rdb.dts, but I can provide the actual dts if you think it's relevant. > And what's the value of SERSR after the machine check ? I'm assuming you're talking about the IPIC SERSR register. I modified machine_check_exception and added a call to ipic_get_mcp_status, which seems to read IPIC_SERSR. The value is 0, both with interrupts enabled and disabled (which makes sense, since disabling/enabling interrupts is local to the CPU core). > Do you use the local bus monitoring driver ? I don't. In fact, I'm not even aware of it. What driver is that? Best regards, Radu