I finally got around to trying this out (patch attached) and it seemed to fix x86. This change seems to break ARM_FS, though. It faults when it tries to execute code at the fault vector because the page table entry supposedly is marked no execute (I think). That makes the timing CPU spin around and around because it keeps getting a fault, invoking it, and attempting to fetch again. The call stack never hits a point where it has to wait for an event, so it never collapses back down and recurses until the stack is too big and M5 segfaults. It seems to just get lost with the atomic CPU and I'm not entirely sure what's going on there, although I suspect the atomic CPU is just structured differently and doesn't recurse infinitely.
I wanted to ask the ARM folks if they knew what was going on here. Is something about the page table walk supposed to be uncached but isn't? This seems to work without that cache added in, so I suspect the walker is picking up stale data or something. Gabe Gabe Black wrote: > Of these, I think the walker cache sounds better for two reasons. First, > it avoids the L1 pollution Ali was talking about, and second, a new bus > would add mostly inert stuff on the way to memory and which would > involve looking up what port to use even though it'd always be the same > one. I'll give that a try. > > Gabe > > Steve Reinhardt wrote: > >> I think the two easy (python-only) solutions are sharing the existing >> L1 via a bus and tacking on a small L1 to the walker. Which one is >> more realistic would depend on what you're trying to model. >> >> Steve >> >> On Tue, Nov 23, 2010 at 8:23 AM, Ali Saidi <sa...@umich.edu >> <mailto:sa...@umich.edu>> wrote: >> >> So what is the relatively good way to make this work in the short >> term? A bus? What about the slightly better version? I suppose a >> small cache might be ok and probably somewhat realistic. >> >> >> >> Thanks, >> >> Ali >> >> >> >> >> >> On Tue, 23 Nov 2010 08:15:01 -0800, Steve Reinhardt >> <ste...@gmail.com <mailto:ste...@gmail.com>> wrote: >> >> >>> And even though I do think it could be made to work, I'm not sure >>> it would be easy or a good idea. There are a lot of corner cases >>> to worry about, especially for writes, since you'd have to >>> actually buffer the write data somewhere as opposed to just >>> remembering that so-and-so has requested an exclusive copy. >>> >>> Actually as I think about it, that might be the case that's >>> breaking now... if the L1 has an exclusive copy and then it >>> snoops a write (and not a read-exclusive), I'm guessing it will >>> just invalidate its copy, losing the modifications. I wouldn't >>> be terribly surprised if reads are working OK (the L1 should >>> snoop those and respond if it's the owner), and of course it's >>> all OK if the L1 doesn't have a copy of the block. >>> >>> So maybe there is a relatively easy way to make this work, but >>> figuring out whether that's true and then testing it is still a >>> non-trivial amount of effort. >>> >>> Steve >>> >>> On Tue, Nov 23, 2010 at 7:57 AM, Steve Reinhardt >>> <ste...@gmail.com <mailto:ste...@gmail.com>> wrote: >>> >>> No, when the L2 receives a request it assumes the L1s above >>> it have already been snooped, which is true since the request >>> came in on the bus that the L1s snoop. The issue is that >>> caches don't necessarily behave correctly when >>> non-cache-block requests come in through their mem-side >>> (snoop) port and not through their cpu-side (request) port. >>> I'm guessing this could be made to work, I'd just be very >>> surprised if it does right now, since the caches weren't >>> designed to deal with this case and aren't tested this way. >>> >>> Steve >>> >>> >>> On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi <sa...@umich.edu >>> <mailto:sa...@umich.edu>> wrote: >>> >>> Does it? Shouldn't the l2 receive the request, ask for >>> the block and end up snooping the l1s? >>> >>> >>> >>> Ali >>> >>> >>> >>> >>> >>> On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt >>> <ste...@gmail.com <mailto:ste...@gmail.com>> wrote: >>> >>> The point is that connecting between the L1 and L2 >>> induces the same problems wrt the L1 that connecting >>> directly to memory induces wrt the whole cache >>> hierarchy. You're just statistically more likely to >>> get away with it in the former case because the L1 is >>> smaller. >>> >>> Steve >>> >>> On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi >>> <sa...@umich.edu <mailto:sa...@umich.edu>> wrote: >>> >>> >>> Where are you connecting the table walker? If >>> it's between the l1 and l2 my guess is that it >>> will work. if it is to the memory bus, yes, >>> memory is just responding without the help of a >>> cache and this could be the reason. >>> >>> Ali >>> >>> >>> >>> On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black >>> <gbl...@eecs.umich.edu >>> <mailto:gbl...@eecs.umich.edu>> wrote: >>> >>> I think I may have just now. I've fixed a few >>> issues, and am now getting >>> to the point where something that should be >>> in the pagetables is causing >>> a page fault. I found where the table walker >>> is walking the tables for >>> this particular access, and the last level >>> entry is all 0s. There could >>> be a number of reasons this is all 0s, but >>> since the main difference >>> other than timing between this and a working >>> configuration is the >>> presence of caches and we've identified a >>> potential issue there, I'm >>> inclined to suspect the actual page table >>> entry is still in the L1 and >>> hasn't been evicted out to memory yet. >>> >>> To fix this, is the best solution to add a >>> bus below the CPU for all the >>> connections that need to go to the L1? I'm >>> assuming they'd all go into >>> the dcache since they're more data-ey and >>> that keeps the icache read >>> only (ignoring SMC issues), and the dcache is >>> probably servicing lower >>> bandwidth normally. It also seems a little >>> strange that this type of >>> configuration is going on in the BaseCPU.py >>> SimObject python file and >>> not a configuration file, but I could be >>> convinced there's a reason. >>> Even if this isn't really a "fix" or the >>> "right thing" to do, I'd still >>> like to try it temporarily at least to see if >>> it corrects the problem >>> I'm seeing. >>> >>> Gabe >>> >>> Ali Saidi wrote: >>> >>> >>> I haven't seen any strange behavior yet. >>> That isn't to say it's not >>> going to cause an issue in the future, >>> but we've taken many a tlb miss >>> and it hasn't fallen over yet. >>> >>> Ali >>> >>> On Mon, 22 Nov 2010 13:08:13 -0800, Steve >>> Reinhardt <ste...@gmail.com >>> <mailto:ste...@gmail.com>> >>> wrote: >>> >>> Yea, I just got around to reading >>> this thread and that was the point >>> I was going to make... the L1 cache >>> effectively serves as a >>> translator between the CPU's >>> word-size read & write requests and the >>> coherent block-level requests that >>> get snooped. If you attach a >>> CPU-like device (such as the table >>> walker) directly to an L2, the >>> CPU-like accesses that go to the L2 >>> will get sent to the L1s but I'm >>> not sure they'll be handled >>> correctly. Not that they fundamentally >>> couldn't, this just isn't a >>> configuration we test so it's likely that >>> there are problems... for example, >>> the L1 may try to hand ownership >>> to the requester but the requester >>> won't recognize that and things >>> will break. >>> >>> Steve >>> >>> On Mon, Nov 22, 2010 at 12:00 PM, >>> Gabe Black <gbl...@eecs.umich.edu >>> <mailto:gbl...@eecs.umich.edu> >>> gbl...@eecs.umich.edu >>> <mailto:gbl...@eecs.umich.edu>>> wrote: >>> >>> What happens if an entry is in the >>> L1 but not the L2? >>> >>> Gabe >>> >>> Ali Saidi wrote: >>> > Between the l1 and l2 caches >>> seems like a good place to me. The >>> caches can cache page table >>> entries, otherwise a tlb miss would >>> be even more expensive then it is. >>> The l1 isn't normally used for >>> such things since it would get >>> polluted (look why sparc has a >>> load 128bits from l2, do not >>> allocate into l1 instruction). >>> > >>> > Ali >>> > >>> > On Nov 22, 2010, at 4:27 AM, >>> Gabe Black wrote: >>> > >>> > >>> >> For anybody waiting for an >>> x86 FS regression (yes, I know, >>> you can >>> >> all hardly wait, but don't let >>> this spoil your Thanksgiving) >>> I'm getting >>> >> closer to having it working, >>> but I've discovered some issues >>> with the >>> >> mechanisms behind the --caches >>> flag with fs.py and x86. I'm >>> surprised I >>> >> never thought to try it before. >>> It also brings up some >>> questions about >>> >> where the table walkers should >>> be hooked up in x86 and ARM. >>> Currently >>> >> it's after the L1, if any, but >>> before the L2, if any, which >>> seems wrong >>> >> to me. Also caches don't seem >>> to propagate requests upwards to >>> the CPUs >>> >> which may or may not be an >>> issue. I'm still looking into that. >>> >> >>> >> Gabe >>> >> >>> >>> _______________________________________________ >>> >> m5-dev mailing list >>> >> m5-dev@m5sim.org >>> <mailto:m5-dev@m5sim.org> >>> m5-dev@m5sim.org >>> <mailto:m5-dev@m5sim.org>> >>> >>> >> >>> http://m5sim.org/mailman/listinfo/m5-dev >>> >> >>> >> >>> > >>> > >>> >>> _______________________________________________ >>> > m5-dev mailing list >>> > m5-dev@m5sim.org >>> <mailto:m5-dev@m5sim.org> >>> m5-dev@m5sim.org >>> <mailto:m5-dev@m5sim.org>> >>> >>> > >>> http://m5sim.org/mailman/listinfo/m5-dev >>> > >>> >>> >>> >>> _______________________________________________ >>> m5-dev mailing list >>> m5-dev@m5sim.org >>> <mailto:m5-dev@m5sim.org> >>> m5-dev@m5sim.org >>> <mailto:m5-dev@m5sim.org>> >>> >>> >>> http://m5sim.org/mailman/listinfo/m5-dev >>> >>> >>> >>> >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> m5-dev mailing list >>> m5-dev@m5sim.org <mailto:m5-dev@m5sim.org> >>> http://m5sim.org/mailman/listinfo/m5-dev >>> >>> >>> _______________________________________________ >>> m5-dev mailing list >>> m5-dev@m5sim.org <mailto:m5-dev@m5sim.org> >>> http://m5sim.org/mailman/listinfo/m5-dev >>> >>> >>> _______________________________________________ >>> m5-dev mailing list >>> m5-dev@m5sim.org <mailto:m5-dev@m5sim.org> >>> http://m5sim.org/mailman/listinfo/m5-dev >>> >>> >>> >>> >>> >>> _______________________________________________ >>> m5-dev mailing list >>> m5-dev@m5sim.org <mailto:m5-dev@m5sim.org> >>> http://m5sim.org/mailman/listinfo/m5-dev >>> >>> >>> >> >> >> >> _______________________________________________ >> m5-dev mailing list >> m5-dev@m5sim.org <mailto:m5-dev@m5sim.org> >> http://m5sim.org/mailman/listinfo/m5-dev >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> m5-dev mailing list >> m5-dev@m5sim.org >> http://m5sim.org/mailman/listinfo/m5-dev >> >> > > _______________________________________________ > m5-dev mailing list > m5-dev@m5sim.org > http://m5sim.org/mailman/listinfo/m5-dev >
X86, ARM: Add L1 caches for the TLB walkers. Small L1 caches are connected to the TLB walkers when caches are used. This allows them to participate in the coherence protocol properly. diff --git a/configs/common/CacheConfig.py b/configs/common/CacheConfig.py --- a/configs/common/CacheConfig.py +++ b/configs/common/CacheConfig.py @@ -43,8 +43,14 @@ for i in xrange(options.num_cpus): if options.caches: - system.cpu[i].addPrivateSplitL1Caches(L1Cache(size = '32kB'), - L1Cache(size = '64kB')) + if buildEnv['TARGET_ISA'] in ['x86', 'arm']: + system.cpu[i].addPrivateSplitL1Caches(L1Cache(size = '32kB'), + L1Cache(size = '64kB'), + PageTableWalkerCache(), + PageTableWalkerCache()) + else: + system.cpu[i].addPrivateSplitL1Caches(L1Cache(size = '32kB'), + L1Cache(size = '64kB')) if options.l2cache: system.cpu[i].connectMemPorts(system.tol2bus) else: diff --git a/configs/common/Caches.py b/configs/common/Caches.py --- a/configs/common/Caches.py +++ b/configs/common/Caches.py @@ -42,6 +42,14 @@ mshrs = 20 tgts_per_mshr = 12 +class PageTableWalkerCache(BaseCache): + assoc = 2 + block_size = 64 + latency = '1ns' + mshrs = 10 + size = '1kB' + tgts_per_mshr = 12 + class IOCache(BaseCache): assoc = 8 block_size = 64 diff --git a/src/cpu/BaseCPU.py b/src/cpu/BaseCPU.py --- a/src/cpu/BaseCPU.py +++ b/src/cpu/BaseCPU.py @@ -166,7 +166,7 @@ if p != 'physmem_port': exec('self.%s = bus.port' % p) - def addPrivateSplitL1Caches(self, ic, dc): + def addPrivateSplitL1Caches(self, ic, dc, iwc = None, dwc = None): assert(len(self._mem_ports) < 8) self.icache = ic self.dcache = dc @@ -175,12 +175,17 @@ self._mem_ports = ['icache.mem_side', 'dcache.mem_side'] if buildEnv['FULL_SYSTEM']: if buildEnv['TARGET_ISA'] in ['x86', 'arm']: - self._mem_ports += ["itb.walker.port", "dtb.walker.port"] + self.itb_walker_cache = iwc + self.dtb_walker_cache = dwc + self.itb.walker.port = iwc.cpu_side + self.dtb.walker.port = dwc.cpu_side + self._mem_ports += ["itb_walker_cache.mem_side", \ + "dtb_walker_cache.mem_side"] if buildEnv['TARGET_ISA'] == 'x86': self._mem_ports += ["interrupts.pio", "interrupts.int_port"] - def addTwoLevelCacheHierarchy(self, ic, dc, l2c): - self.addPrivateSplitL1Caches(ic, dc) + def addTwoLevelCacheHierarchy(self, ic, dc, l2c, iwc = None, dwc = None): + self.addPrivateSplitL1Caches(ic, dc, iwc, dwc) self.toL2Bus = Bus() self.connectMemPorts(self.toL2Bus) self.l2cache = l2c diff --git a/src/cpu/o3/O3CPU.py b/src/cpu/o3/O3CPU.py --- a/src/cpu/o3/O3CPU.py +++ b/src/cpu/o3/O3CPU.py @@ -141,7 +141,7 @@ smtROBThreshold = Param.Int(100, "SMT ROB Threshold Sharing Parameter") smtCommitPolicy = Param.String('RoundRobin', "SMT Commit Policy") - def addPrivateSplitL1Caches(self, ic, dc): - BaseCPU.addPrivateSplitL1Caches(self, ic, dc) + def addPrivateSplitL1Caches(self, ic, dc, iwc = None, dwc = None): + BaseCPU.addPrivateSplitL1Caches(self, ic, dc, iwc, dwc) self.icache.tgts_per_mshr = 20 self.dcache.tgts_per_mshr = 20 diff --git a/tests/configs/realview-simple-atomic.py b/tests/configs/realview-simple-atomic.py --- a/tests/configs/realview-simple-atomic.py +++ b/tests/configs/realview-simple-atomic.py @@ -53,6 +53,17 @@ write_buffers = 8 # --------------------- +# Page table walker cache +# --------------------- +class PageTableWalkerCache(BaseCache): + assoc = 2 + block_size = 64 + latency = '1ns' + mshrs = 10 + size = '1kB' + tgts_per_mshr = 12 + +# --------------------- # I/O Cache # --------------------- class IOCache(BaseCache): @@ -86,7 +97,9 @@ #connect up the cpu and l1s cpu.addPrivateSplitL1Caches(L1(size = '32kB', assoc = 1), - L1(size = '32kB', assoc = 4)) + L1(size = '32kB', assoc = 4), + PageTableWalkerCache(), + PageTableWalkerCache()) # connect cpu level-1 caches to shared level-2 cache cpu.connectMemPorts(system.toL2Bus) cpu.clock = '2GHz' diff --git a/tests/configs/realview-simple-timing.py b/tests/configs/realview-simple-timing.py --- a/tests/configs/realview-simple-timing.py +++ b/tests/configs/realview-simple-timing.py @@ -54,6 +54,17 @@ write_buffers = 8 # --------------------- +# Page table walker cache +# --------------------- +class PageTableWalkerCache(BaseCache): + assoc = 2 + block_size = 64 + latency = '1ns' + mshrs = 10 + size = '1kB' + tgts_per_mshr = 12 + +# --------------------- # I/O Cache # --------------------- class IOCache(BaseCache): @@ -88,7 +99,9 @@ #connect up the cpu and l1s cpu.addPrivateSplitL1Caches(L1(size = '32kB', assoc = 1), - L1(size = '32kB', assoc = 4)) + L1(size = '32kB', assoc = 4), + PageTableWalkerCache(), + PageTableWalkerCache()) # connect cpu level-1 caches to shared level-2 cache cpu.connectMemPorts(system.toL2Bus) cpu.clock = '2GHz'
_______________________________________________ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev