> -----Original Message----- > From: Tom Rini [mailto:trini at kernel.crashing.org] > Sent: 07 November 2005 16:52 > To: Marcelo Tosatti > Cc: Joakim Tjernlund; Pantelis Antoniou; Dan Malek; > linuxppc-embedded at ozlabs.org; gtolstolytkin at ru.mvista.com > Subject: Re: [PATCH 2.6.14] mm: 8xx MM fix for > > On Mon, Nov 07, 2005 at 08:16:18AM -0200, Marcelo Tosatti wrote: > > Joakim! > > > > On Mon, Nov 07, 2005 at 03:32:52PM +0100, Joakim Tjernlund wrote: > > > Hi Marcelo > > > > > > [SNIP] > > > > The root of the problem are the changes against the 8xx TLB > > > > handlers introduced > > > > during v2.6. What happens is the TLBMiss handlers load the > > > > zeroed pte into > > > > the TLB, causing the TLBError handler to be invoked (thats > > > > two TLB faults per > > > > pagefault), which then jumps to the generic MM code to > setup the pte. > > > > > > > > The bug is that the zeroed TLB is not invalidated (the > same reason > > > > for the "dcbst" misbehaviour), resulting in infinite > TLBError faults. > > > > > > > > Dan, I wonder why we just don't go back to v2.4 behaviour. > > > > > > This is one reason why it is the way it is: > > > > http://ozlabs.org/pipermail/linuxppc-embedded/2005-January/016382.html > > > This details are little fuzzy ATM, but I think the reason for the > > > current > > > impl. was only that it was less intrusive to impl. > > > > Ah, I see. I wonder if the bug is processor specific: we > don't have such > > changes in our v2.4 tree and never experienced such problem. > > > > It should be pretty easy to hit it right? (instruction > pagefaults should > > fail). > > > > Grigori, Tom, can you enlight us about the issue on the URL > above. How > > can it be triggered? > > So after looking at the code in 2.6.14 and current git, I think the > above URL isn't relevant, unless there was a change I missed (which > could totally be possible) that reverted the patch there and > fixed that > issue in a different manner. But since I didn't figure that > out until I > had finished researching it again:
I wasn't clear enough. What I meant was that the above patch made me think and the result was that I came up with a simpler fix, the "two exception" fix that is in current kernels. See http://linux.bkbits.net:8080/linux-2.6/diffs/arch/ppc/kernel/head_8xx.S@ 1.19?nav=index.html|src/.|src/arch|src/arch/ppc|src/arch/ppc/kernel|hist /arch/ppc/kernel/head_8xx.S It appears this fix has some other issues :( How do the other ppc arches do? I am guessing that they don't double fault, but bails out to do_page_fault from the TLB Miss handler, like 8xx used to do. > > Switching hats for a minute, this came from a bug a customer of > MontaVista found, so I can't give out the testcase :( > > To repeat what Joakim said back then: > "I think I have figured this out. The first TLB misses that happen at > app startup is Data TLB misses. These will then hit the NULL L1 entry > and end up in do_page_fault() which will populate the L1 > entry. But when > you have a very large app that spans more than one L1 entry (16 MB I > think) it may happen that you will have I-TLB Miss first one of the L1 > entrys which will make the I-TLB handler bail out to > do_page_fault() and > the app craches(SEGV)." This still stands I think. > > Looking at the patch again, what I don't see is why I talk > about fudging > I-TLB Miss at 0x400 when it's I-TLB Error we fudge at being there, but > then get hung up that there can be a slight diff between the > two ("This > is because we check bit 4 of SRR1 in both cases, but in the case of an > I-TLB Miss, this bit is always set, and it only indicates a protection > fault on an I-TLB Error.") so instead of 0x1300 jumping to the handler > at 0x400, we treat it like a regular exception so we know > where we came > from, and perhaps missed fixing a case somewhere? Didn't look into this part of your patch, sorry. Jocke