On Mon, Nov 07, 2005 at 08:16:18AM -0200, Marcelo Tosatti wrote: > Joakim! > > On Mon, Nov 07, 2005 at 03:32:52PM +0100, Joakim Tjernlund wrote: > > Hi Marcelo > > > > [SNIP] > > > The root of the problem are the changes against the 8xx TLB > > > handlers introduced > > > during v2.6. What happens is the TLBMiss handlers load the > > > zeroed pte into > > > the TLB, causing the TLBError handler to be invoked (thats > > > two TLB faults per > > > pagefault), which then jumps to the generic MM code to setup the pte. > > > > > > The bug is that the zeroed TLB is not invalidated (the same reason > > > for the "dcbst" misbehaviour), resulting in infinite TLBError faults. > > > > > > Dan, I wonder why we just don't go back to v2.4 behaviour. > > > > This is one reason why it is the way it is: > > http://ozlabs.org/pipermail/linuxppc-embedded/2005-January/016382.html > > This details are little fuzzy ATM, but I think the reason for the > > current > > impl. was only that it was less intrusive to impl. > > Ah, I see. I wonder if the bug is processor specific: we don't have such > changes in our v2.4 tree and never experienced such problem. > > It should be pretty easy to hit it right? (instruction pagefaults should > fail). > > Grigori, Tom, can you enlight us about the issue on the URL above. How > can it be triggered?
So after looking at the code in 2.6.14 and current git, I think the above URL isn't relevant, unless there was a change I missed (which could totally be possible) that reverted the patch there and fixed that issue in a different manner. But since I didn't figure that out until I had finished researching it again: Switching hats for a minute, this came from a bug a customer of MontaVista found, so I can't give out the testcase :( To repeat what Joakim said back then: "I think I have figured this out. The first TLB misses that happen at app startup is Data TLB misses. These will then hit the NULL L1 entry and end up in do_page_fault() which will populate the L1 entry. But when you have a very large app that spans more than one L1 entry (16 MB I think) it may happen that you will have I-TLB Miss first one of the L1 entrys which will make the I-TLB handler bail out to do_page_fault() and the app craches(SEGV)." Looking at the patch again, what I don't see is why I talk about fudging I-TLB Miss at 0x400 when it's I-TLB Error we fudge at being there, but then get hung up that there can be a slight diff between the two ("This is because we check bit 4 of SRR1 in both cases, but in the case of an I-TLB Miss, this bit is always set, and it only indicates a protection fault on an I-TLB Error.") so instead of 0x1300 jumping to the handler at 0x400, we treat it like a regular exception so we know where we came from, and perhaps missed fixing a case somewhere? -- Tom Rini http://gate.crashing.org/~trini/