Kumar, To follow up on our postings from late last week... (which I was expecting a response (but never got) from you)...
----- We (well, mostly a very bright engineer who was very persistent) have(has) found the origin of how the kernel TLB got corrupted. We tracked down the problem to a programming bug in the DataStorage exception handler for our kernel (2.6.23). We have looked at newer kernels, and have noticed that this piece of processing has changed, but let me explain to you what happened (and the conditions that caused the problem on our MPC8572E (running SMP)... If you follow the logic of in this version of the kernel, it reads the SPRN_DEAR into register R10, and then does some operations (including a tlbsx operation (which uses R10)), and then attempts to update the associated PTE entry. Well, if you have REALLY bad luck, sometime between the time you took this exception and try to update the PTE for this page, the other core has decided to invalidate this page's PTE. The good part is the kernel recognizes this unlucky case. Unfortunately, in this 'bad luck' case, a kernel bug was Introduced. The kernel uses R10 for some processing (puts the physical address associated with this virtual page) and then branches up 'above' the tlbsx operation to try again ...without restoring R10 to the SPRN_DEAR required by the tlbsx operation... This means, that even though the kernel recognized this exceptional problem, it NEVER did the right thing, and instead, the kernel would (attempt) to modify the unlucky TLB virtual address that corresponds to the physical address of the original DataStorage exception. The only way we caught this is that we also had a second piece of 'bad luck' by having that physical address map to the virtual address of the kernel (0xC0000000), and thus, when it loops back to try again, it gets the kernel page(s) from the tlbsx operation, and modifies permissions on the kernel pages and thus causing an InstructionStore Exception (forever). We fixed this in our kernel by just restoring R10 to SPRN_DEAR value just before it loops back, something like this: ================================ .... mtspr SPRN_MAS1, r13 tlbwe /* because we did NOT find in PTE */ /* r10 was changed - so we need */ /* to re-load it here to work */ mfspr r10, SPRN_DEAR /* restore the faulting address */ b 5b /* Try again */ .... ================================ That's the short and long of it...and 4 weeks of very stressful problems... I am wondering why nobody has found this problem before - are we the first to be this unlucky? I am not sure that is a good thing! Comments? Suggestions? What else should I be doing with this information? Tom Morrison Principal Software Engineer EMPIRIX 20 Crosby Drive - Bedford, MA 01730 p: 781.266.3567 f: 781.266.3670 email: tmorri...@empirix.com www.empirix.com >> -----Original Message----- >> From: Morrison, Tom >> Sent: Thursday, May 21, 2009 11:24 AM >> To: Morrison, Tom; Kumar Gala >> Cc: linuxppc-dev@ozlabs.org; Young, Andrew; Brown, Jeff; Geary Sean- >> R60898 >> Subject: RE: How to debug a hung multi-core system.... >> >> Just had a little conference with several co-workers...to go over results >> >> We think that LT0 (the one that maps the kernel) has been corrupted: >> >> Entry EPN RPN TID TMASK WIMGE TSIZ U0:3 X0:1 >> --------------------------------------------------------------- >> LT0 C0000000 00000000 00 0FF 04 9 0 0 >> >> PID TS PROT SHEN UR UW UX SR SW SX TIDZ VAL >> --------------------------------------------------------------- >> 0 0 P P E E D E E D D V >> >> Is absolutely wrong - this is TLB for the kernel - and as you can see >> ...it does NOT have execution privileges (and in fact the user space >> HAS executive privileges for this area (complete opposite of what it >> should be)... >> >> This is why it is stuck AT that instruction (can't even single step >> from that location).. >> >> (one of) The first problem(s) is how can/when did this TLB get corrupted! >> >> Tom _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev