Re: [m5-dev] X86 FS regression
I finally got around to trying this out (patch attached) and it seemed to fix x86. This change seems to break ARM_FS, though. It faults when it tries to execute code at the fault vector because the page table entry supposedly is marked no execute (I think). That makes the timing CPU spin around and around because it keeps getting a fault, invoking it, and attempting to fetch again. The call stack never hits a point where it has to wait for an event, so it never collapses back down and recurses until the stack is too big and M5 segfaults. It seems to just get lost with the atomic CPU and I'm not entirely sure what's going on there, although I suspect the atomic CPU is just structured differently and doesn't recurse infinitely. I wanted to ask the ARM folks if they knew what was going on here. Is something about the page table walk supposed to be uncached but isn't? This seems to work without that cache added in, so I suspect the walker is picking up stale data or something. Gabe Gabe Black wrote: Of these, I think the walker cache sounds better for two reasons. First, it avoids the L1 pollution Ali was talking about, and second, a new bus would add mostly inert stuff on the way to memory and which would involve looking up what port to use even though it'd always be the same one. I'll give that a try. Gabe Steve Reinhardt wrote: I think the two easy (python-only) solutions are sharing the existing L1 via a bus and tacking on a small L1 to the walker. Which one is more realistic would depend on what you're trying to model. Steve On Tue, Nov 23, 2010 at 8:23 AM, Ali Saidi sa...@umich.edu mailto:sa...@umich.edu wrote: So what is the relatively good way to make this work in the short term? A bus? What about the slightly better version? I suppose a small cache might be ok and probably somewhat realistic. Thanks, Ali On Tue, 23 Nov 2010 08:15:01 -0800, Steve Reinhardt ste...@gmail.com mailto:ste...@gmail.com wrote: And even though I do think it could be made to work, I'm not sure it would be easy or a good idea. There are a lot of corner cases to worry about, especially for writes, since you'd have to actually buffer the write data somewhere as opposed to just remembering that so-and-so has requested an exclusive copy. Actually as I think about it, that might be the case that's breaking now... if the L1 has an exclusive copy and then it snoops a write (and not a read-exclusive), I'm guessing it will just invalidate its copy, losing the modifications. I wouldn't be terribly surprised if reads are working OK (the L1 should snoop those and respond if it's the owner), and of course it's all OK if the L1 doesn't have a copy of the block. So maybe there is a relatively easy way to make this work, but figuring out whether that's true and then testing it is still a non-trivial amount of effort. Steve On Tue, Nov 23, 2010 at 7:57 AM, Steve Reinhardt ste...@gmail.com mailto:ste...@gmail.com wrote: No, when the L2 receives a request it assumes the L1s above it have already been snooped, which is true since the request came in on the bus that the L1s snoop. The issue is that caches don't necessarily behave correctly when non-cache-block requests come in through their mem-side (snoop) port and not through their cpu-side (request) port. I'm guessing this could be made to work, I'd just be very surprised if it does right now, since the caches weren't designed to deal with this case and aren't tested this way. Steve On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi sa...@umich.edu mailto:sa...@umich.edu wrote: Does it? Shouldn't the l2 receive the request, ask for the block and end up snooping the l1s? Ali On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt ste...@gmail.com mailto:ste...@gmail.com wrote: The point is that connecting between the L1 and L2 induces the same problems wrt the L1 that connecting directly to memory induces wrt the whole cache hierarchy. You're just statistically more likely to get away with it in the former case because the L1 is smaller. Steve On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi sa...@umich.edu mailto:sa...@umich.edu wrote: Where are you connecting the table walker? If it's between the l1 and l2 my guess is that it will work. if it is to the memory bus, yes, memory is just responding without the help of a cache and this could be
Re: [m5-dev] X86 FS regression
I've got a patch that gets closer to supporting caches between the TLB and L2 cache, but it doesn't work. Since we don't have a way to invalidate addresses out of the cache if you switch a memory region to uncachable, the address remains in the cache and causes all sorts of havoc. If you want to make some changes, please make them x86 only for now. We'll need to implement some form of cache cleaning or invalidating before it can be supported on ARM. Thanks, Ali On Dec 13, 2010, at 4:56 AM, Gabe Black wrote: I finally got around to trying this out (patch attached) and it seemed to fix x86. This change seems to break ARM_FS, though. It faults when it tries to execute code at the fault vector because the page table entry supposedly is marked no execute (I think). That makes the timing CPU spin around and around because it keeps getting a fault, invoking it, and attempting to fetch again. The call stack never hits a point where it has to wait for an event, so it never collapses back down and recurses until the stack is too big and M5 segfaults. It seems to just get lost with the atomic CPU and I'm not entirely sure what's going on there, although I suspect the atomic CPU is just structured differently and doesn't recurse infinitely. I wanted to ask the ARM folks if they knew what was going on here. Is something about the page table walk supposed to be uncached but isn't? This seems to work without that cache added in, so I suspect the walker is picking up stale data or something. Gabe Gabe Black wrote: Of these, I think the walker cache sounds better for two reasons. First, it avoids the L1 pollution Ali was talking about, and second, a new bus would add mostly inert stuff on the way to memory and which would involve looking up what port to use even though it'd always be the same one. I'll give that a try. Gabe Steve Reinhardt wrote: I think the two easy (python-only) solutions are sharing the existing L1 via a bus and tacking on a small L1 to the walker. Which one is more realistic would depend on what you're trying to model. Steve On Tue, Nov 23, 2010 at 8:23 AM, Ali Saidi sa...@umich.edu mailto:sa...@umich.edu wrote: So what is the relatively good way to make this work in the short term? A bus? What about the slightly better version? I suppose a small cache might be ok and probably somewhat realistic. Thanks, Ali On Tue, 23 Nov 2010 08:15:01 -0800, Steve Reinhardt ste...@gmail.com mailto:ste...@gmail.com wrote: And even though I do think it could be made to work, I'm not sure it would be easy or a good idea. There are a lot of corner cases to worry about, especially for writes, since you'd have to actually buffer the write data somewhere as opposed to just remembering that so-and-so has requested an exclusive copy. Actually as I think about it, that might be the case that's breaking now... if the L1 has an exclusive copy and then it snoops a write (and not a read-exclusive), I'm guessing it will just invalidate its copy, losing the modifications. I wouldn't be terribly surprised if reads are working OK (the L1 should snoop those and respond if it's the owner), and of course it's all OK if the L1 doesn't have a copy of the block. So maybe there is a relatively easy way to make this work, but figuring out whether that's true and then testing it is still a non-trivial amount of effort. Steve On Tue, Nov 23, 2010 at 7:57 AM, Steve Reinhardt ste...@gmail.com mailto:ste...@gmail.com wrote: No, when the L2 receives a request it assumes the L1s above it have already been snooped, which is true since the request came in on the bus that the L1s snoop. The issue is that caches don't necessarily behave correctly when non-cache-block requests come in through their mem-side (snoop) port and not through their cpu-side (request) port. I'm guessing this could be made to work, I'd just be very surprised if it does right now, since the caches weren't designed to deal with this case and aren't tested this way. Steve On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi sa...@umich.edu mailto:sa...@umich.edu wrote: Does it? Shouldn't the l2 receive the request, ask for the block and end up snooping the l1s? Ali On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt ste...@gmail.com mailto:ste...@gmail.com wrote: The point is that connecting between the L1 and L2 induces the same problems wrt the L1 that connecting directly to memory induces wrt the whole cache hierarchy. You're just statistically more likely to get away with it in the former case because the L1 is smaller.
Re: [m5-dev] X86 FS regression
On Wed, Dec 1, 2010 at 3:07 PM, Ali Saidi sa...@umich.edu wrote: Continuing the e-mail thread that never dies It appears as though the dcache some how does the correct thing when a read request comes into the l2 bus. Note that the dcache is snooping the request. Listening for system connection on port 3456 481100500: system.cpu.dtb.walker: Begining table walk for address 0xc020, TTBCR: 0, bits:0 481100500: system.cpu.dtb.walker: - Selecting TTBR0 481100500: system.cpu.dtb.walker: - Descriptor at address 0x7008 481100500: system.tol2bus: recvAtomic: packet src 4 dest -1 addr 0x7008 cmd ReadReq 481100500: system.cpu.dcache: snooped a ReadReq request for addr 7000, responding, new state is 13 481100500: system.l2: rcvd mem-inhibited ReadReq on 0x7008: not responding 481100500: system.cpu.dtb.walker: L1 descriptor for 0xc020 is 0x20040e After some work I managed to get a cache to work in this case too The table walker has to kick off a sendStatusChange() otherwise the cache below it doesn't get added to the snooping list of the tol2bus. I'm not too surprised that reads work, but what about writes (e.g., if the TLB walker sets an accessed bit)? Steve ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] X86 FS regression
Continuing the e-mail thread that never dies It appears as though the dcache some how does the correct thing when a read request comes into the l2 bus. Note that the dcache is snooping the request. Listening for system connection on port 3456 481100500: system.cpu.dtb.walker: Begining table walk for address 0xc020, TTBCR: 0, bits:0 481100500: system.cpu.dtb.walker: - Selecting TTBR0 481100500: system.cpu.dtb.walker: - Descriptor at address 0x7008 481100500: system.tol2bus: recvAtomic: packet src 4 dest -1 addr 0x7008 cmd ReadReq 481100500: system.cpu.dcache: snooped a ReadReq request for addr 7000, responding, new state is 13 481100500: system.l2: rcvd mem-inhibited ReadReq on 0x7008: not responding 481100500: system.cpu.dtb.walker: L1 descriptor for 0xc020 is 0x20040e After some work I managed to get a cache to work in this case too The table walker has to kick off a sendStatusChange() otherwise the cache below it doesn't get added to the snooping list of the tol2bus. Ali On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt wrote: The point is that connecting between the L1 and L2 induces the same problems wrt the L1 that connecting directly to memory induces wrt the whole cache hierarchy. You're just statistically more likely to get away with it in the former case because the L1 is smaller. Steve On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi wrote: Where are you connecting the table walker? If it's between the l1 and l2 my guess is that it will work. if it is to the memory bus, yes, memory is just responding without the help of a cache and this could be the reason. Ali On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black wrote: I think I may have just now. I've fixed a few issues, and am now getting to the point where something that should be in the pagetables is causing a page fault. I found where the table walker is walking the tables for this particular access, and the last level entry is all 0s. There could be a number of reasons this is all 0s, but since the main difference other than timing between this and a working configuration is the presence of caches and we've identified a potential issue there, I'm inclined to suspect the actual page table entry is still in the L1 and hasn't been evicted out to memory yet. To fix this, is the best solution to add a bus below the CPU for all the connections that need to go to the L1? I'm assuming they'd all go into the dcache since they're more data-ey and that keeps the icache read only (ignoring SMC issues), and the dcache is probably servicing lower bandwidth normally. It also seems a little strange that this type of configuration is going on in the BaseCPU.py SimObject python file and not a configuration file, but I could be convinced there's a reason. Even if this isn't really a fix or the right thing to do, I'd still like to try it temporarily at least to see if it corrects the problem I'm seeing. Gabe Ali Saidi wrote: I haven't seen any strange behavior yet. That isn't to say it's not going to cause an issue in the future, but we've taken many a tlb miss and it hasn't fallen over yet. Ali On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt wrote: Yea, I just got around to reading this thread and that was the point I was going to make... the L1 cache effectively serves as a translator between the CPU's word-size read write requests and the coherent block-level requests that get snooped. If you attach a CPU-like device (such as the table walker) directly to an L2, the CPU-like accesses that go to the L2 will get sent to the L1s but I'm not sure they'll be handled correctly. Not that they fundamentally couldn't, this just isn't a configuration we test so it's likely that there are problems... for example, the L1 may try to hand ownership to the requester but the requester won't recognize that and things will break. Steve On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black wrote: What happens if an entry is in the L1 but not the L2? Gabe Ali Saidi wrote: Between the l1 and l2 caches seems like a good place to me. The caches can cache page table entries, otherwise a tlb miss would be even more expensive then it is. The l1 isn't normally used for such things since it would get polluted (look why sparc has a load 128bits from l2, do not allocate into l1 instruction). Ali On Nov 22, 2010, at 4:27 AM, Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem
Re: [m5-dev] X86 FS regression
I see that the bridge and cache are in parallel like you're describing. The culprit seems to be this line: configs/example/fs.py:test_sys.bridge.filter_ranges_a=[AddrRange(0, Addr.max)] where the bridge is being told explicitly not to let anything through from the IO side to the memory side. That should be fairly straightforward to poke a hole in for the necessary ranges. The corresponding line for the other direction (below) brings up another question. What happens if the bridge doesn't disallow something to go across and something else wants to respond to an address? The bridge isn't set to ignore APIC messages implementing IPIs between CPUs, but those seem to be going between CPUs and not out into the IO system. Are we just getting lucky? This same thing would seem to apply to any other memory side object that isn't in the address range 0-mem_size. configs/example/fs.py: test_sys.bridge.filter_ranges_b=[AddrRange(mem_size)] Gabe Steve Reinhardt wrote: I believe the I/O cache is normally paired with a bridge that lets things flow in the other direction. It's really just designed to handle accesses to cacheable space from devices on the I/O bus without requiring each device to have a cache. It's possible we've never had a situation before where I/O devices issue accesses to uncacheable non-memory locations on the CPU side of the I/O cache, in which case I would not be terribly surprised if that didn't quite work. Steve On Mon, Nov 22, 2010 at 11:59 AM, Gabe Black gbl...@eecs.umich.edu mailto:gbl...@eecs.umich.edu wrote: The cache claims to support all addresses on the CPU side (or so says the comments), but no addresses on the memory side. Messages going from the IO interrupt controller get to the IO bus but then don't know where to go since the IO cache hides the fact that the CPU interrupt controller wants to receive messages on that address range. I also don't know if the cache can handle messages passing through originating from the memory side, but I didn't look into that. Gabe Ali Saidi wrote: Something has to maintain i/o coherency and that something looks an whole lot like a couple line cache. Why is having a cache there any issue, they should pass right through the cache? Ali On Nov 22, 2010, at 4:42 AM, Gabe Black wrote: Hmm. It looks like this IO cache is only added when there are caches in the system (a fix for some coherency something? I sort of remember that discussion.) and that wouldn't propagate to the IO bus the fact that the CPU's local APIC wanted to receive interrupt messages passed over the memory system. I don't know the intricacies of why the IO cache was necessary, or what problems passing requests back up through the cache might cause, but this is a serious issue for x86 and any other ISA that wants to move to a message based interrupt scheme. I suppose the interrupt objects could be connected all the way out onto the IO bus itself, bypassing that cache, but I'm not sure how realistic that is. Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still looking into that. Gabe ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___
Re: [m5-dev] X86 FS regression
I think I may have just now. I've fixed a few issues, and am now getting to the point where something that should be in the pagetables is causing a page fault. I found where the table walker is walking the tables for this particular access, and the last level entry is all 0s. There could be a number of reasons this is all 0s, but since the main difference other than timing between this and a working configuration is the presence of caches and we've identified a potential issue there, I'm inclined to suspect the actual page table entry is still in the L1 and hasn't been evicted out to memory yet. To fix this, is the best solution to add a bus below the CPU for all the connections that need to go to the L1? I'm assuming they'd all go into the dcache since they're more data-ey and that keeps the icache read only (ignoring SMC issues), and the dcache is probably servicing lower bandwidth normally. It also seems a little strange that this type of configuration is going on in the BaseCPU.py SimObject python file and not a configuration file, but I could be convinced there's a reason. Even if this isn't really a fix or the right thing to do, I'd still like to try it temporarily at least to see if it corrects the problem I'm seeing. Gabe Ali Saidi wrote: I haven't seen any strange behavior yet. That isn't to say it's not going to cause an issue in the future, but we've taken many a tlb miss and it hasn't fallen over yet. Ali On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt ste...@gmail.com wrote: Yea, I just got around to reading this thread and that was the point I was going to make... the L1 cache effectively serves as a translator between the CPU's word-size read write requests and the coherent block-level requests that get snooped. If you attach a CPU-like device (such as the table walker) directly to an L2, the CPU-like accesses that go to the L2 will get sent to the L1s but I'm not sure they'll be handled correctly. Not that they fundamentally couldn't, this just isn't a configuration we test so it's likely that there are problems... for example, the L1 may try to hand ownership to the requester but the requester won't recognize that and things will break. Steve On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black gbl...@eecs.umich.edu mailto:gbl...@eecs.umich.edu wrote: What happens if an entry is in the L1 but not the L2? Gabe Ali Saidi wrote: Between the l1 and l2 caches seems like a good place to me. The caches can cache page table entries, otherwise a tlb miss would be even more expensive then it is. The l1 isn't normally used for such things since it would get polluted (look why sparc has a load 128bits from l2, do not allocate into l1 instruction). Ali On Nov 22, 2010, at 4:27 AM, Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still looking into that. Gabe ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] X86 FS regression
Where are you connecting the table walker? If it's between the l1 and l2 my guess is that it will work. if it is to the memory bus, yes, memory is just responding without the help of a cache and this could be the reason. Ali On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black gbl...@eecs.umich.edu wrote: I think I may have just now. I've fixed a few issues, and am now getting to the point where something that should be in the pagetables is causing a page fault. I found where the table walker is walking the tables for this particular access, and the last level entry is all 0s. There could be a number of reasons this is all 0s, but since the main difference other than timing between this and a working configuration is the presence of caches and we've identified a potential issue there, I'm inclined to suspect the actual page table entry is still in the L1 and hasn't been evicted out to memory yet. To fix this, is the best solution to add a bus below the CPU for all the connections that need to go to the L1? I'm assuming they'd all go into the dcache since they're more data-ey and that keeps the icache read only (ignoring SMC issues), and the dcache is probably servicing lower bandwidth normally. It also seems a little strange that this type of configuration is going on in the BaseCPU.py SimObject python file and not a configuration file, but I could be convinced there's a reason. Even if this isn't really a fix or the right thing to do, I'd still like to try it temporarily at least to see if it corrects the problem I'm seeing. Gabe Ali Saidi wrote: I haven't seen any strange behavior yet. That isn't to say it's not going to cause an issue in the future, but we've taken many a tlb miss and it hasn't fallen over yet. Ali On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt ste...@gmail.com wrote: Yea, I just got around to reading this thread and that was the point I was going to make... the L1 cache effectively serves as a translator between the CPU's word-size read write requests and the coherent block-level requests that get snooped. If you attach a CPU-like device (such as the table walker) directly to an L2, the CPU-like accesses that go to the L2 will get sent to the L1s but I'm not sure they'll be handled correctly. Not that they fundamentally couldn't, this just isn't a configuration we test so it's likely that there are problems... for example, the L1 may try to hand ownership to the requester but the requester won't recognize that and things will break. Steve On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black gbl...@eecs.umich.edu mailto:gbl...@eecs.umich.edu wrote: What happens if an entry is in the L1 but not the L2? Gabe Ali Saidi wrote: Between the l1 and l2 caches seems like a good place to me. The caches can cache page table entries, otherwise a tlb miss would be even more expensive then it is. The l1 isn't normally used for such things since it would get polluted (look why sparc has a load 128bits from l2, do not allocate into l1 instruction). Ali On Nov 22, 2010, at 4:27 AM, Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still looking into that. Gabe ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] X86 FS regression
IIRC, the filter works in conjunction with the address range autodetection stuff, so in order for a memory request to go across the bridge, the targeted address must lie on the other side *and* not be filtered out. I expect this explains why IPIs aren't going across. Thinking about it, I'm not sure why the I/O cache doesn't let uncached accesses through from the I/O side to the memory side, assuming the target exists on the memory side. CPU caches certainly let uncached accesses through, and it's the same cache module in both cases. Hmm, looking at fs.py, I think this line may be as much of a culprit as the others: test_sys.iocache = IOCache(addr_range=mem_size) I believe the address range exclusions are necessary to avoid an infinite loop between the iocache and the bridge in the address range autodetection algorithm, but perhaps the ranges are set up a little too conservatively so that uncacheable addresses have no way through. I don't think it matters whether you open up the range in the iocache or in the bridge to let them through, as long as (1) you only do one and not the other and (2) it's selective enough that it doesn't include any PCI addresses that might result in a loop. Steve On Tue, Nov 23, 2010 at 12:17 AM, Gabe Black gbl...@eecs.umich.edu wrote: I see that the bridge and cache are in parallel like you're describing. The culprit seems to be this line: configs/example/fs.py:test_sys.bridge.filter_ranges_a=[AddrRange(0, Addr.max)] where the bridge is being told explicitly not to let anything through from the IO side to the memory side. That should be fairly straightforward to poke a hole in for the necessary ranges. The corresponding line for the other direction (below) brings up another question. What happens if the bridge doesn't disallow something to go across and something else wants to respond to an address? The bridge isn't set to ignore APIC messages implementing IPIs between CPUs, but those seem to be going between CPUs and not out into the IO system. Are we just getting lucky? This same thing would seem to apply to any other memory side object that isn't in the address range 0-mem_size. configs/example/fs.py: test_sys.bridge.filter_ranges_b=[AddrRange(mem_size)] Gabe Steve Reinhardt wrote: I believe the I/O cache is normally paired with a bridge that lets things flow in the other direction. It's really just designed to handle accesses to cacheable space from devices on the I/O bus without requiring each device to have a cache. It's possible we've never had a situation before where I/O devices issue accesses to uncacheable non-memory locations on the CPU side of the I/O cache, in which case I would not be terribly surprised if that didn't quite work. Steve On Mon, Nov 22, 2010 at 11:59 AM, Gabe Black gbl...@eecs.umich.edu mailto:gbl...@eecs.umich.edu wrote: The cache claims to support all addresses on the CPU side (or so says the comments), but no addresses on the memory side. Messages going from the IO interrupt controller get to the IO bus but then don't know where to go since the IO cache hides the fact that the CPU interrupt controller wants to receive messages on that address range. I also don't know if the cache can handle messages passing through originating from the memory side, but I didn't look into that. Gabe Ali Saidi wrote: Something has to maintain i/o coherency and that something looks an whole lot like a couple line cache. Why is having a cache there any issue, they should pass right through the cache? Ali On Nov 22, 2010, at 4:42 AM, Gabe Black wrote: Hmm. It looks like this IO cache is only added when there are caches in the system (a fix for some coherency something? I sort of remember that discussion.) and that wouldn't propagate to the IO bus the fact that the CPU's local APIC wanted to receive interrupt messages passed over the memory system. I don't know the intricacies of why the IO cache was necessary, or what problems passing requests back up through the cache might cause, but this is a serious issue for x86 and any other ISA that wants to move to a message based interrupt scheme. I suppose the interrupt objects could be connected all the way out onto the IO bus itself, bypassing that cache, but I'm not sure how realistic that is. Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to
Re: [m5-dev] X86 FS regression
The point is that connecting between the L1 and L2 induces the same problems wrt the L1 that connecting directly to memory induces wrt the whole cache hierarchy. You're just statistically more likely to get away with it in the former case because the L1 is smaller. Steve On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi sa...@umich.edu wrote: Where are you connecting the table walker? If it's between the l1 and l2 my guess is that it will work. if it is to the memory bus, yes, memory is just responding without the help of a cache and this could be the reason. Ali On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black gbl...@eecs.umich.edu wrote: I think I may have just now. I've fixed a few issues, and am now getting to the point where something that should be in the pagetables is causing a page fault. I found where the table walker is walking the tables for this particular access, and the last level entry is all 0s. There could be a number of reasons this is all 0s, but since the main difference other than timing between this and a working configuration is the presence of caches and we've identified a potential issue there, I'm inclined to suspect the actual page table entry is still in the L1 and hasn't been evicted out to memory yet. To fix this, is the best solution to add a bus below the CPU for all the connections that need to go to the L1? I'm assuming they'd all go into the dcache since they're more data-ey and that keeps the icache read only (ignoring SMC issues), and the dcache is probably servicing lower bandwidth normally. It also seems a little strange that this type of configuration is going on in the BaseCPU.py SimObject python file and not a configuration file, but I could be convinced there's a reason. Even if this isn't really a fix or the right thing to do, I'd still like to try it temporarily at least to see if it corrects the problem I'm seeing. Gabe Ali Saidi wrote: I haven't seen any strange behavior yet. That isn't to say it's not going to cause an issue in the future, but we've taken many a tlb miss and it hasn't fallen over yet. Ali On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt ste...@gmail.com wrote: Yea, I just got around to reading this thread and that was the point I was going to make... the L1 cache effectively serves as a translator between the CPU's word-size read write requests and the coherent block-level requests that get snooped. If you attach a CPU-like device (such as the table walker) directly to an L2, the CPU-like accesses that go to the L2 will get sent to the L1s but I'm not sure they'll be handled correctly. Not that they fundamentally couldn't, this just isn't a configuration we test so it's likely that there are problems... for example, the L1 may try to hand ownership to the requester but the requester won't recognize that and things will break. Steve On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black gbl...@eecs.umich.edu mailto:gbl...@eecs.umich.edu wrote: What happens if an entry is in the L1 but not the L2? Gabe Ali Saidi wrote: Between the l1 and l2 caches seems like a good place to me. The caches can cache page table entries, otherwise a tlb miss would be even more expensive then it is. The l1 isn't normally used for such things since it would get polluted (look why sparc has a load 128bits from l2, do not allocate into l1 instruction). Ali On Nov 22, 2010, at 4:27 AM, Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still looking into that. Gabe ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] X86 FS regression
I definitely agree that putting a bus between the CPU and L1 and plugging the table walker in there is the best way to figure out if this is really the problem (and I expect it is). I'm not sure if it's the long-term right answer or not. We also need to consider how this works with Ruby. Steve On Tue, Nov 23, 2010 at 3:29 AM, Gabe Black gbl...@eecs.umich.edu wrote: I think I may have just now. I've fixed a few issues, and am now getting to the point where something that should be in the pagetables is causing a page fault. I found where the table walker is walking the tables for this particular access, and the last level entry is all 0s. There could be a number of reasons this is all 0s, but since the main difference other than timing between this and a working configuration is the presence of caches and we've identified a potential issue there, I'm inclined to suspect the actual page table entry is still in the L1 and hasn't been evicted out to memory yet. To fix this, is the best solution to add a bus below the CPU for all the connections that need to go to the L1? I'm assuming they'd all go into the dcache since they're more data-ey and that keeps the icache read only (ignoring SMC issues), and the dcache is probably servicing lower bandwidth normally. It also seems a little strange that this type of configuration is going on in the BaseCPU.py SimObject python file and not a configuration file, but I could be convinced there's a reason. Even if this isn't really a fix or the right thing to do, I'd still like to try it temporarily at least to see if it corrects the problem I'm seeing. Gabe Ali Saidi wrote: I haven't seen any strange behavior yet. That isn't to say it's not going to cause an issue in the future, but we've taken many a tlb miss and it hasn't fallen over yet. Ali On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt ste...@gmail.com wrote: Yea, I just got around to reading this thread and that was the point I was going to make... the L1 cache effectively serves as a translator between the CPU's word-size read write requests and the coherent block-level requests that get snooped. If you attach a CPU-like device (such as the table walker) directly to an L2, the CPU-like accesses that go to the L2 will get sent to the L1s but I'm not sure they'll be handled correctly. Not that they fundamentally couldn't, this just isn't a configuration we test so it's likely that there are problems... for example, the L1 may try to hand ownership to the requester but the requester won't recognize that and things will break. Steve On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black gbl...@eecs.umich.edu mailto:gbl...@eecs.umich.edu wrote: What happens if an entry is in the L1 but not the L2? Gabe Ali Saidi wrote: Between the l1 and l2 caches seems like a good place to me. The caches can cache page table entries, otherwise a tlb miss would be even more expensive then it is. The l1 isn't normally used for such things since it would get polluted (look why sparc has a load 128bits from l2, do not allocate into l1 instruction). Ali On Nov 22, 2010, at 4:27 AM, Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still looking into that. Gabe ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org mailto:m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] X86 FS regression
Does it? Shouldn't the l2 receive the request, ask for the block and end up snooping the l1s? Ali On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt wrote: The point is that connecting between the L1 and L2 induces the same problems wrt the L1 that connecting directly to memory induces wrt the whole cache hierarchy. You're just statistically more likely to get away with it in the former case because the L1 is smaller. Steve On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi wrote: Where are you connecting the table walker? If it's between the l1 and l2 my guess is that it will work. if it is to the memory bus, yes, memory is just responding without the help of a cache and this could be the reason. Ali On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black wrote: I think I may have just now. I've fixed a few issues, and am now getting to the point where something that should be in the pagetables is causing a page fault. I found where the table walker is walking the tables for this particular access, and the last level entry is all 0s. There could be a number of reasons this is all 0s, but since the main difference other than timing between this and a working configuration is the presence of caches and we've identified a potential issue there, I'm inclined to suspect the actual page table entry is still in the L1 and hasn't been evicted out to memory yet. To fix this, is the best solution to add a bus below the CPU for all the connections that need to go to the L1? I'm assuming they'd all go into the dcache since they're more data-ey and that keeps the icache read only (ignoring SMC issues), and the dcache is probably servicing lower bandwidth normally. It also seems a little strange that this type of configuration is going on in the BaseCPU.py SimObject python file and not a configuration file, but I could be convinced there's a reason. Even if this isn't really a fix or the right thing to do, I'd still like to try it temporarily at least to see if it corrects the problem I'm seeing. Gabe Ali Saidi wrote: I haven't seen any strange behavior yet. That isn't to say it's not going to cause an issue in the future, but we've taken many a tlb miss and it hasn't fallen over yet. Ali On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt wrote: Yea, I just got around to reading this thread and that was the point I was going to make... the L1 cache effectively serves as a translator between the CPU's word-size read write requests and the coherent block-level requests that get snooped. If you attach a CPU-like device (such as the table walker) directly to an L2, the CPU-like accesses that go to the L2 will get sent to the L1s but I'm not sure they'll be handled correctly. Not that they fundamentally couldn't, this just isn't a configuration we test so it's likely that there are problems... for example, the L1 may try to hand ownership to the requester but the requester won't recognize that and things will break. Steve On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black wrote: What happens if an entry is in the L1 but not the L2? Gabe Ali Saidi wrote: Between the l1 and l2 caches seems like a good place to me. The caches can cache page table entries, otherwise a tlb miss would be even more expensive then it is. The l1 isn't normally used for such things since it would get polluted (look why sparc has a load 128bits from l2, do not allocate into l1 instruction). Ali On Nov 22, 2010, at 4:27 AM, Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still looking into that. Gabe ___ m5-dev mailing list m5-dev@m5sim.org [5] m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev [6] ___ m5-dev mailing list m5-dev@m5sim.org [7] m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev [8] ___ m5-dev mailing list m5-dev@m5sim.org [9] m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev [10] ___ m5-dev mailing list m5-dev@m5sim.org [11] http://m5sim.org/mailman/listinfo/m5-dev [12] ___ m5-dev mailing list m5-dev@m5sim.org [13]
Re: [m5-dev] X86 FS regression
No, when the L2 receives a request it assumes the L1s above it have already been snooped, which is true since the request came in on the bus that the L1s snoop. The issue is that caches don't necessarily behave correctly when non-cache-block requests come in through their mem-side (snoop) port and not through their cpu-side (request) port. I'm guessing this could be made to work, I'd just be very surprised if it does right now, since the caches weren't designed to deal with this case and aren't tested this way. Steve On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi sa...@umich.edu wrote: Does it? Shouldn't the l2 receive the request, ask for the block and end up snooping the l1s? Ali On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt ste...@gmail.com wrote: The point is that connecting between the L1 and L2 induces the same problems wrt the L1 that connecting directly to memory induces wrt the whole cache hierarchy. You're just statistically more likely to get away with it in the former case because the L1 is smaller. Steve On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi sa...@umich.edu wrote: Where are you connecting the table walker? If it's between the l1 and l2 my guess is that it will work. if it is to the memory bus, yes, memory is just responding without the help of a cache and this could be the reason. Ali On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black gbl...@eecs.umich.edu wrote: I think I may have just now. I've fixed a few issues, and am now getting to the point where something that should be in the pagetables is causing a page fault. I found where the table walker is walking the tables for this particular access, and the last level entry is all 0s. There could be a number of reasons this is all 0s, but since the main difference other than timing between this and a working configuration is the presence of caches and we've identified a potential issue there, I'm inclined to suspect the actual page table entry is still in the L1 and hasn't been evicted out to memory yet. To fix this, is the best solution to add a bus below the CPU for all the connections that need to go to the L1? I'm assuming they'd all go into the dcache since they're more data-ey and that keeps the icache read only (ignoring SMC issues), and the dcache is probably servicing lower bandwidth normally. It also seems a little strange that this type of configuration is going on in the BaseCPU.py SimObject python file and not a configuration file, but I could be convinced there's a reason. Even if this isn't really a fix or the right thing to do, I'd still like to try it temporarily at least to see if it corrects the problem I'm seeing. Gabe Ali Saidi wrote: I haven't seen any strange behavior yet. That isn't to say it's not going to cause an issue in the future, but we've taken many a tlb miss and it hasn't fallen over yet. Ali On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt ste...@gmail.com wrote: Yea, I just got around to reading this thread and that was the point I was going to make... the L1 cache effectively serves as a translator between the CPU's word-size read write requests and the coherent block-level requests that get snooped. If you attach a CPU-like device (such as the table walker) directly to an L2, the CPU-like accesses that go to the L2 will get sent to the L1s but I'm not sure they'll be handled correctly. Not that they fundamentally couldn't, this just isn't a configuration we test so it's likely that there are problems... for example, the L1 may try to hand ownership to the requester but the requester won't recognize that and things will break. Steve On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black gbl...@eecs.umich.edu gbl...@eecs.umich.edu wrote: What happens if an entry is in the L1 but not the L2? Gabe Ali Saidi wrote: Between the l1 and l2 caches seems like a good place to me. The caches can cache page table entries, otherwise a tlb miss would be even more expensive then it is. The l1 isn't normally used for such things since it would get polluted (look why sparc has a load 128bits from l2, do not allocate into l1 instruction). Ali On Nov 22, 2010, at 4:27 AM, Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still
Re: [m5-dev] X86 FS regression
And even though I do think it could be made to work, I'm not sure it would be easy or a good idea. There are a lot of corner cases to worry about, especially for writes, since you'd have to actually buffer the write data somewhere as opposed to just remembering that so-and-so has requested an exclusive copy. Actually as I think about it, that might be the case that's breaking now... if the L1 has an exclusive copy and then it snoops a write (and not a read-exclusive), I'm guessing it will just invalidate its copy, losing the modifications. I wouldn't be terribly surprised if reads are working OK (the L1 should snoop those and respond if it's the owner), and of course it's all OK if the L1 doesn't have a copy of the block. So maybe there is a relatively easy way to make this work, but figuring out whether that's true and then testing it is still a non-trivial amount of effort. Steve On Tue, Nov 23, 2010 at 7:57 AM, Steve Reinhardt ste...@gmail.com wrote: No, when the L2 receives a request it assumes the L1s above it have already been snooped, which is true since the request came in on the bus that the L1s snoop. The issue is that caches don't necessarily behave correctly when non-cache-block requests come in through their mem-side (snoop) port and not through their cpu-side (request) port. I'm guessing this could be made to work, I'd just be very surprised if it does right now, since the caches weren't designed to deal with this case and aren't tested this way. Steve On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi sa...@umich.edu wrote: Does it? Shouldn't the l2 receive the request, ask for the block and end up snooping the l1s? Ali On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt ste...@gmail.com wrote: The point is that connecting between the L1 and L2 induces the same problems wrt the L1 that connecting directly to memory induces wrt the whole cache hierarchy. You're just statistically more likely to get away with it in the former case because the L1 is smaller. Steve On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi sa...@umich.edu wrote: Where are you connecting the table walker? If it's between the l1 and l2 my guess is that it will work. if it is to the memory bus, yes, memory is just responding without the help of a cache and this could be the reason. Ali On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black gbl...@eecs.umich.edu wrote: I think I may have just now. I've fixed a few issues, and am now getting to the point where something that should be in the pagetables is causing a page fault. I found where the table walker is walking the tables for this particular access, and the last level entry is all 0s. There could be a number of reasons this is all 0s, but since the main difference other than timing between this and a working configuration is the presence of caches and we've identified a potential issue there, I'm inclined to suspect the actual page table entry is still in the L1 and hasn't been evicted out to memory yet. To fix this, is the best solution to add a bus below the CPU for all the connections that need to go to the L1? I'm assuming they'd all go into the dcache since they're more data-ey and that keeps the icache read only (ignoring SMC issues), and the dcache is probably servicing lower bandwidth normally. It also seems a little strange that this type of configuration is going on in the BaseCPU.py SimObject python file and not a configuration file, but I could be convinced there's a reason. Even if this isn't really a fix or the right thing to do, I'd still like to try it temporarily at least to see if it corrects the problem I'm seeing. Gabe Ali Saidi wrote: I haven't seen any strange behavior yet. That isn't to say it's not going to cause an issue in the future, but we've taken many a tlb miss and it hasn't fallen over yet. Ali On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt ste...@gmail.com wrote: Yea, I just got around to reading this thread and that was the point I was going to make... the L1 cache effectively serves as a translator between the CPU's word-size read write requests and the coherent block-level requests that get snooped. If you attach a CPU-like device (such as the table walker) directly to an L2, the CPU-like accesses that go to the L2 will get sent to the L1s but I'm not sure they'll be handled correctly. Not that they fundamentally couldn't, this just isn't a configuration we test so it's likely that there are problems... for example, the L1 may try to hand ownership to the requester but the requester won't recognize that and things will break. Steve On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black gbl...@eecs.umich.edu gbl...@eecs.umich.edu wrote: What happens if an entry is in the L1 but not the L2? Gabe Ali Saidi wrote: Between the l1 and l2 caches seems like a good place to me. The caches can cache page table entries, otherwise a tlb
Re: [m5-dev] X86 FS regression
So what is the relatively good way to make this work in the short term? A bus? What about the slightly better version? I suppose a small cache might be ok and probably somewhat realistic. Thanks, Ali On Tue, 23 Nov 2010 08:15:01 -0800, Steve Reinhardt wrote: And even though I do think it could be made to work, I'm not sure it would be easy or a good idea. There are a lot of corner cases to worry about, especially for writes, since you'd have to actually buffer the write data somewhere as opposed to just remembering that so-and-so has requested an exclusive copy. Actually as I think about it, that might be the case that's breaking now... if the L1 has an exclusive copy and then it snoops a write (and not a read-exclusive), I'm guessing it will just invalidate its copy, losing the modifications. I wouldn't be terribly surprised if reads are working OK (the L1 should snoop those and respond if it's the owner), and of course it's all OK if the L1 doesn't have a copy of the block. So maybe there is a relatively easy way to make this work, but figuring out whether that's true and then testing it is still a non-trivial amount of effort. Steve On Tue, Nov 23, 2010 at 7:57 AM, Steve Reinhardt wrote: No, when the L2 receives a request it assumes the L1s above it have already been snooped, which is true since the request came in on the bus that the L1s snoop. The issue is that caches don't necessarily behave correctly when non-cache-block requests come in through their mem-side (snoop) port and not through their cpu-side (request) port. I'm guessing this could be made to work, I'd just be very surprised if it does right now, since the caches weren't designed to deal with this case and aren't tested this way. Steve On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi wrote: Does it? Shouldn't the l2 receive the request, ask for the block and end up snooping the l1s? Ali On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt wrote: The point is that connecting between the L1 and L2 induces the same problems wrt the L1 that connecting directly to memory induces wrt the whole cache hierarchy. You're just statistically more likely to get away with it in the former case because the L1 is smaller. Steve On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi wrote: Where are you connecting the table walker? If it's between the l1 and l2 my guess is that it will work. if it is to the memory bus, yes, memory is just responding without the help of a cache and this could be the reason. Ali On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black wrote: I think I may have just now. I've fixed a few issues, and am now getting to the point where something that should be in the pagetables is causing a page fault. I found where the table walker is walking the tables for this particular access, and the last level entry is all 0s. There could be a number of reasons this is all 0s, but since the main difference other than timing between this and a working configuration is the presence of caches and we've identified a potential issue there, I'm inclined to suspect the actual page table entry is still in the L1 and hasn't been evicted out to memory yet. To fix this, is the best solution to add a bus below the CPU for all the connections that need to go to the L1? I'm assuming they'd all go into the dcache since they're more data-ey and that keeps the icache read only (ignoring SMC issues), and the dcache is probably servicing lower bandwidth normally. It also seems a little strange that this type of configuration is going on in the BaseCPU.py SimObject python file and not a configuration file, but I could be convinced there's a reason. Even if this isn't really a fix or the right thing to do, I'd still like to try it temporarily at least to see if it corrects the problem I'm seeing. Gabe Ali Saidi wrote: I haven't seen any strange behavior yet. That isn't to say it's not going to cause an issue in the future, but we've taken many a tlb miss and it hasn't fallen over yet. Ali On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt wrote: Yea, I just got around to reading this thread and that was the point I was going to make... the L1 cache effectively serves as a translator between the CPU's word-size read write requests and the coherent block-level requests that get snooped. If you attach a CPU-like device (such as the table walker) directly to an L2, the CPU-like accesses that go to the L2 will get sent to the L1s but I'm not sure they'll be handled correctly. Not that they fundamentally couldn't, this just isn't a configuration we test so it's likely that there are problems... for example, the L1 may try to hand ownership to the requester but the requester won't recognize that and things will break. Steve On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black wrote: What happens if an entry is in the L1 but not the L2? Gabe Ali Saidi wrote: Between the l1 and l2 caches
Re: [m5-dev] X86 FS regression
I think the two easy (python-only) solutions are sharing the existing L1 via a bus and tacking on a small L1 to the walker. Which one is more realistic would depend on what you're trying to model. Steve On Tue, Nov 23, 2010 at 8:23 AM, Ali Saidi sa...@umich.edu wrote: So what is the relatively good way to make this work in the short term? A bus? What about the slightly better version? I suppose a small cache might be ok and probably somewhat realistic. Thanks, Ali On Tue, 23 Nov 2010 08:15:01 -0800, Steve Reinhardt ste...@gmail.com wrote: And even though I do think it could be made to work, I'm not sure it would be easy or a good idea. There are a lot of corner cases to worry about, especially for writes, since you'd have to actually buffer the write data somewhere as opposed to just remembering that so-and-so has requested an exclusive copy. Actually as I think about it, that might be the case that's breaking now... if the L1 has an exclusive copy and then it snoops a write (and not a read-exclusive), I'm guessing it will just invalidate its copy, losing the modifications. I wouldn't be terribly surprised if reads are working OK (the L1 should snoop those and respond if it's the owner), and of course it's all OK if the L1 doesn't have a copy of the block. So maybe there is a relatively easy way to make this work, but figuring out whether that's true and then testing it is still a non-trivial amount of effort. Steve On Tue, Nov 23, 2010 at 7:57 AM, Steve Reinhardt ste...@gmail.com wrote: No, when the L2 receives a request it assumes the L1s above it have already been snooped, which is true since the request came in on the bus that the L1s snoop. The issue is that caches don't necessarily behave correctly when non-cache-block requests come in through their mem-side (snoop) port and not through their cpu-side (request) port. I'm guessing this could be made to work, I'd just be very surprised if it does right now, since the caches weren't designed to deal with this case and aren't tested this way. Steve On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi sa...@umich.edu wrote: Does it? Shouldn't the l2 receive the request, ask for the block and end up snooping the l1s? Ali On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt ste...@gmail.com wrote: The point is that connecting between the L1 and L2 induces the same problems wrt the L1 that connecting directly to memory induces wrt the whole cache hierarchy. You're just statistically more likely to get away with it in the former case because the L1 is smaller. Steve On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi sa...@umich.edu wrote: Where are you connecting the table walker? If it's between the l1 and l2 my guess is that it will work. if it is to the memory bus, yes, memory is just responding without the help of a cache and this could be the reason. Ali On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black gbl...@eecs.umich.edu wrote: I think I may have just now. I've fixed a few issues, and am now getting to the point where something that should be in the pagetables is causing a page fault. I found where the table walker is walking the tables for this particular access, and the last level entry is all 0s. There could be a number of reasons this is all 0s, but since the main difference other than timing between this and a working configuration is the presence of caches and we've identified a potential issue there, I'm inclined to suspect the actual page table entry is still in the L1 and hasn't been evicted out to memory yet. To fix this, is the best solution to add a bus below the CPU for all the connections that need to go to the L1? I'm assuming they'd all go into the dcache since they're more data-ey and that keeps the icache read only (ignoring SMC issues), and the dcache is probably servicing lower bandwidth normally. It also seems a little strange that this type of configuration is going on in the BaseCPU.py SimObject python file and not a configuration file, but I could be convinced there's a reason. Even if this isn't really a fix or the right thing to do, I'd still like to try it temporarily at least to see if it corrects the problem I'm seeing. Gabe Ali Saidi wrote: I haven't seen any strange behavior yet. That isn't to say it's not going to cause an issue in the future, but we've taken many a tlb miss and it hasn't fallen over yet. Ali On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt ste...@gmail.com wrote: Yea, I just got around to reading this thread and that was the point I was going to make... the L1 cache effectively serves as a translator between the CPU's word-size read write requests and the coherent block-level requests that get snooped. If you attach a CPU-like device (such as the table walker) directly to an L2, the CPU-like accesses that go to the L2 will get sent to the L1s but I'm not sure they'll be
Re: [m5-dev] X86 FS regression
Of these, I think the walker cache sounds better for two reasons. First, it avoids the L1 pollution Ali was talking about, and second, a new bus would add mostly inert stuff on the way to memory and which would involve looking up what port to use even though it'd always be the same one. I'll give that a try. Gabe Steve Reinhardt wrote: I think the two easy (python-only) solutions are sharing the existing L1 via a bus and tacking on a small L1 to the walker. Which one is more realistic would depend on what you're trying to model. Steve On Tue, Nov 23, 2010 at 8:23 AM, Ali Saidi sa...@umich.edu mailto:sa...@umich.edu wrote: So what is the relatively good way to make this work in the short term? A bus? What about the slightly better version? I suppose a small cache might be ok and probably somewhat realistic. Thanks, Ali On Tue, 23 Nov 2010 08:15:01 -0800, Steve Reinhardt ste...@gmail.com mailto:ste...@gmail.com wrote: And even though I do think it could be made to work, I'm not sure it would be easy or a good idea. There are a lot of corner cases to worry about, especially for writes, since you'd have to actually buffer the write data somewhere as opposed to just remembering that so-and-so has requested an exclusive copy. Actually as I think about it, that might be the case that's breaking now... if the L1 has an exclusive copy and then it snoops a write (and not a read-exclusive), I'm guessing it will just invalidate its copy, losing the modifications. I wouldn't be terribly surprised if reads are working OK (the L1 should snoop those and respond if it's the owner), and of course it's all OK if the L1 doesn't have a copy of the block. So maybe there is a relatively easy way to make this work, but figuring out whether that's true and then testing it is still a non-trivial amount of effort. Steve On Tue, Nov 23, 2010 at 7:57 AM, Steve Reinhardt ste...@gmail.com mailto:ste...@gmail.com wrote: No, when the L2 receives a request it assumes the L1s above it have already been snooped, which is true since the request came in on the bus that the L1s snoop. The issue is that caches don't necessarily behave correctly when non-cache-block requests come in through their mem-side (snoop) port and not through their cpu-side (request) port. I'm guessing this could be made to work, I'd just be very surprised if it does right now, since the caches weren't designed to deal with this case and aren't tested this way. Steve On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi sa...@umich.edu mailto:sa...@umich.edu wrote: Does it? Shouldn't the l2 receive the request, ask for the block and end up snooping the l1s? Ali On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt ste...@gmail.com mailto:ste...@gmail.com wrote: The point is that connecting between the L1 and L2 induces the same problems wrt the L1 that connecting directly to memory induces wrt the whole cache hierarchy. You're just statistically more likely to get away with it in the former case because the L1 is smaller. Steve On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi sa...@umich.edu mailto:sa...@umich.edu wrote: Where are you connecting the table walker? If it's between the l1 and l2 my guess is that it will work. if it is to the memory bus, yes, memory is just responding without the help of a cache and this could be the reason. Ali On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black gbl...@eecs.umich.edu mailto:gbl...@eecs.umich.edu wrote: I think I may have just now. I've fixed a few issues, and am now getting to the point where something that should be in the pagetables is causing a page fault. I found where the table walker is walking the tables for this particular access, and the last level entry is all 0s. There could be a number of reasons this is all 0s, but since the main difference other than timing between this and a working configuration is the presence of caches and we've identified a
[m5-dev] X86 FS regression
For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still looking into that. Gabe ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] X86 FS regression
Hmm. It looks like this IO cache is only added when there are caches in the system (a fix for some coherency something? I sort of remember that discussion.) and that wouldn't propagate to the IO bus the fact that the CPU's local APIC wanted to receive interrupt messages passed over the memory system. I don't know the intricacies of why the IO cache was necessary, or what problems passing requests back up through the cache might cause, but this is a serious issue for x86 and any other ISA that wants to move to a message based interrupt scheme. I suppose the interrupt objects could be connected all the way out onto the IO bus itself, bypassing that cache, but I'm not sure how realistic that is. Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still looking into that. Gabe ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] X86 FS regression
Between the l1 and l2 caches seems like a good place to me. The caches can cache page table entries, otherwise a tlb miss would be even more expensive then it is. The l1 isn't normally used for such things since it would get polluted (look why sparc has a load 128bits from l2, do not allocate into l1 instruction). Ali On Nov 22, 2010, at 4:27 AM, Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still looking into that. Gabe ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] X86 FS regression
Something has to maintain i/o coherency and that something looks an whole lot like a couple line cache. Why is having a cache there any issue, they should pass right through the cache? Ali On Nov 22, 2010, at 4:42 AM, Gabe Black wrote: Hmm. It looks like this IO cache is only added when there are caches in the system (a fix for some coherency something? I sort of remember that discussion.) and that wouldn't propagate to the IO bus the fact that the CPU's local APIC wanted to receive interrupt messages passed over the memory system. I don't know the intricacies of why the IO cache was necessary, or what problems passing requests back up through the cache might cause, but this is a serious issue for x86 and any other ISA that wants to move to a message based interrupt scheme. I suppose the interrupt objects could be connected all the way out onto the IO bus itself, bypassing that cache, but I'm not sure how realistic that is. Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still looking into that. Gabe ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] X86 FS regression
The cache claims to support all addresses on the CPU side (or so says the comments), but no addresses on the memory side. Messages going from the IO interrupt controller get to the IO bus but then don't know where to go since the IO cache hides the fact that the CPU interrupt controller wants to receive messages on that address range. I also don't know if the cache can handle messages passing through originating from the memory side, but I didn't look into that. Gabe Ali Saidi wrote: Something has to maintain i/o coherency and that something looks an whole lot like a couple line cache. Why is having a cache there any issue, they should pass right through the cache? Ali On Nov 22, 2010, at 4:42 AM, Gabe Black wrote: Hmm. It looks like this IO cache is only added when there are caches in the system (a fix for some coherency something? I sort of remember that discussion.) and that wouldn't propagate to the IO bus the fact that the CPU's local APIC wanted to receive interrupt messages passed over the memory system. I don't know the intricacies of why the IO cache was necessary, or what problems passing requests back up through the cache might cause, but this is a serious issue for x86 and any other ISA that wants to move to a message based interrupt scheme. I suppose the interrupt objects could be connected all the way out onto the IO bus itself, bypassing that cache, but I'm not sure how realistic that is. Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still looking into that. Gabe ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] X86 FS regression
What happens if an entry is in the L1 but not the L2? Gabe Ali Saidi wrote: Between the l1 and l2 caches seems like a good place to me. The caches can cache page table entries, otherwise a tlb miss would be even more expensive then it is. The l1 isn't normally used for such things since it would get polluted (look why sparc has a load 128bits from l2, do not allocate into l1 instruction). Ali On Nov 22, 2010, at 4:27 AM, Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still looking into that. Gabe ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] X86 FS regression
I believe the I/O cache is normally paired with a bridge that lets things flow in the other direction. It's really just designed to handle accesses to cacheable space from devices on the I/O bus without requiring each device to have a cache. It's possible we've never had a situation before where I/O devices issue accesses to uncacheable non-memory locations on the CPU side of the I/O cache, in which case I would not be terribly surprised if that didn't quite work. Steve On Mon, Nov 22, 2010 at 11:59 AM, Gabe Black gbl...@eecs.umich.edu wrote: The cache claims to support all addresses on the CPU side (or so says the comments), but no addresses on the memory side. Messages going from the IO interrupt controller get to the IO bus but then don't know where to go since the IO cache hides the fact that the CPU interrupt controller wants to receive messages on that address range. I also don't know if the cache can handle messages passing through originating from the memory side, but I didn't look into that. Gabe Ali Saidi wrote: Something has to maintain i/o coherency and that something looks an whole lot like a couple line cache. Why is having a cache there any issue, they should pass right through the cache? Ali On Nov 22, 2010, at 4:42 AM, Gabe Black wrote: Hmm. It looks like this IO cache is only added when there are caches in the system (a fix for some coherency something? I sort of remember that discussion.) and that wouldn't propagate to the IO bus the fact that the CPU's local APIC wanted to receive interrupt messages passed over the memory system. I don't know the intricacies of why the IO cache was necessary, or what problems passing requests back up through the cache might cause, but this is a serious issue for x86 and any other ISA that wants to move to a message based interrupt scheme. I suppose the interrupt objects could be connected all the way out onto the IO bus itself, bypassing that cache, but I'm not sure how realistic that is. Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still looking into that. Gabe ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] X86 FS regression
Yea, I just got around to reading this thread and that was the point I was going to make... the L1 cache effectively serves as a translator between the CPU's word-size read write requests and the coherent block-level requests that get snooped. If you attach a CPU-like device (such as the table walker) directly to an L2, the CPU-like accesses that go to the L2 will get sent to the L1s but I'm not sure they'll be handled correctly. Not that they fundamentally couldn't, this just isn't a configuration we test so it's likely that there are problems... for example, the L1 may try to hand ownership to the requester but the requester won't recognize that and things will break. Steve On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black gbl...@eecs.umich.edu wrote: What happens if an entry is in the L1 but not the L2? Gabe Ali Saidi wrote: Between the l1 and l2 caches seems like a good place to me. The caches can cache page table entries, otherwise a tlb miss would be even more expensive then it is. The l1 isn't normally used for such things since it would get polluted (look why sparc has a load 128bits from l2, do not allocate into l1 instruction). Ali On Nov 22, 2010, at 4:27 AM, Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still looking into that. Gabe ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev
Re: [m5-dev] X86 FS regression
I haven't seen any strange behavior yet. That isn't to say it's not going to cause an issue in the future, but we've taken many a tlb miss and it hasn't fallen over yet. Ali On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt wrote: Yea, I just got around to reading this thread and that was the point I was going to make... the L1 cache effectively serves as a translator between the CPU's word-size read write requests and the coherent block-level requests that get snooped. If you attach a CPU-like device (such as the table walker) directly to an L2, the CPU-like accesses that go to the L2 will get sent to the L1s but I'm not sure they'll be handled correctly. Not that they fundamentally couldn't, this just isn't a configuration we test so it's likely that there are problems... for example, the L1 may try to hand ownership to the requester but the requester won't recognize that and things will break. Steve On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black wrote: What happens if an entry is in the L1 but not the L2? Gabe Ali Saidi wrote: Between the l1 and l2 caches seems like a good place to me. The caches can cache page table entries, otherwise a tlb miss would be even more expensive then it is. The l1 isn't normally used for such things since it would get polluted (look why sparc has a load 128bits from l2, do not allocate into l1 instruction). Ali On Nov 22, 2010, at 4:27 AM, Gabe Black wrote: For anybody waiting for an x86 FS regression (yes, I know, you can all hardly wait, but don't let this spoil your Thanksgiving) I'm getting closer to having it working, but I've discovered some issues with the mechanisms behind the --caches flag with fs.py and x86. I'm surprised I never thought to try it before. It also brings up some questions about where the table walkers should be hooked up in x86 and ARM. Currently it's after the L1, if any, but before the L2, if any, which seems wrong to me. Also caches don't seem to propagate requests upwards to the CPUs which may or may not be an issue. I'm still looking into that. Gabe ___ m5-dev mailing list m5-dev@m5sim.org [2] http://m5sim.org/mailman/listinfo/m5-dev [3] ___ m5-dev mailing list m5-dev@m5sim.org [4] http://m5sim.org/mailman/listinfo/m5-dev [5] ___ m5-dev mailing list m5-dev@m5sim.org [6] http://m5sim.org/mailman/listinfo/m5-dev [7] Links: -- [1] mailto:gbl...@eecs.umich.edu [2] mailto:m5-dev@m5sim.org [3] http://m5sim.org/mailman/listinfo/m5-dev [4] mailto:m5-dev@m5sim.org [5] http://m5sim.org/mailman/listinfo/m5-dev [6] mailto:m5-dev@m5sim.org [7] http://m5sim.org/mailman/listinfo/m5-dev ___ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev