On Sat, 21 May 2011 08:15:36 +1000 Benjamin Herrenschmidt <b...@kernel.crashing.org> wrote:
> On Fri, 2011-05-20 at 15:57 -0500, Scott Wood wrote: > > > I see a 2% cost going from virtual pmd to full 4-level walk in the > > benchmark mentioned above (some type of sort), and just under 3% in > > page-stride lat_mem_rd from lmbench. > > > > OTOH, the virtual pmd approach still leaves the possibility of taking a > > bunch of virtual page table misses if non-localized accesses happen over a > > very large chunk of address space (tens of GiB), and we'd have one fewer > > type of TLB miss to worry about complexity-wise with a straight table walk. > > > > Let me know what you'd prefer. > > I'm tempted to kill the virtual linear feature alltogether.. it didn't > buy us that much. Have you looked if you can snatch back some of those > cycles with hand tuning of the level walker ? That's after trying a bit of that (pulled the pgd load up before normal_tlb_miss, and some other reordering). Not sure how much more can be squeezed out of it with such techniques, at least with e5500. Hmm, in the normal miss case we know we're in the first EXTLB level, right? So we could cut out a load/mfspr by subtracting EXTLB from r12 to get the PACA (that load's latency is pretty well buried, but maybe we could replace it with loading pgd, replacing it later if it's a kernel region). Maybe move pgd to the first EXTLB, so it's in the same cache line as the state save data. The PACA cacheline containing pgd is probably pretty hot in normal kernel code, but not so much in a long stretch of userspace plus TLB misses (other than for pgd itself). > Would it work/help to have a simple cache of the last pmd & address and > compare just that ? Maybe. It would still slow down the case where you miss that cache -- not by as much as a virtual page table miss (and it wouldn't compete for TLB entries with actual user pages), but it would happen more often, since you'd only be able to cache one pmd. > Maybe in a SPRG or a known cache hot location like > the PACA in a line that we already load anyways ? A cache access is faster than a SPRG access on our chips (plus we don't have many to spare, especially if we want to avoid swapping SPRG4-7 on guest entry/exit in KVM), so I'd favor putting it in the PACA. I'll try this stuff out and see what helps. -Scott _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev