Re: [m5-dev] X86 FS regression

2010-12-13 Thread Gabe Black
I finally got around to trying this out (patch attached) and it seemed
to fix x86. This change seems to break ARM_FS, though. It faults when it
tries to execute code at the fault vector because the page table entry
supposedly is marked no execute (I think). That makes the timing CPU
spin around and around because it keeps getting a fault, invoking it,
and attempting to fetch again. The call stack never hits a point where
it has to wait for an event, so it never collapses back down and
recurses until the stack is too big and M5 segfaults. It seems to just
get lost with the atomic CPU and I'm not entirely sure what's going on
there, although I suspect the atomic CPU is just structured differently
and doesn't recurse infinitely.

I wanted to ask the ARM folks if they knew what was going on here. Is
something about the page table walk supposed to be uncached but isn't?
This seems to work without that cache added in, so I suspect the walker
is picking up stale data or something.

Gabe

Gabe Black wrote:
 Of these, I think the walker cache sounds better for two reasons. First,
 it avoids the L1 pollution Ali was talking about, and second, a new bus
 would add mostly inert stuff on the way to memory and which would
 involve looking up what port to use even though it'd always be the same
 one. I'll give that a try.

 Gabe

 Steve Reinhardt wrote:
   
 I think the two easy (python-only) solutions are sharing the existing
 L1 via a bus and tacking on a small L1 to the walker.  Which one is
 more realistic would depend on what you're trying to model.

 Steve

 On Tue, Nov 23, 2010 at 8:23 AM, Ali Saidi sa...@umich.edu
 mailto:sa...@umich.edu wrote:

 So what is the relatively good way to make this work in the short
 term? A bus? What about the slightly better version? I suppose a
 small cache might be ok and probably somewhat realistic.

  

 Thanks,

 Ali

  

  

 On Tue, 23 Nov 2010 08:15:01 -0800, Steve Reinhardt
 ste...@gmail.com mailto:ste...@gmail.com wrote:

 
 And even though I do think it could be made to work, I'm not sure
 it would be easy or a good idea.  There are a lot of corner cases
 to worry about, especially for writes, since you'd have to
 actually buffer the write data somewhere as opposed to just
 remembering that so-and-so has requested an exclusive copy.

 Actually as I think about it, that might be the case that's
 breaking now... if the L1 has an exclusive copy and then it
 snoops a write (and not a read-exclusive), I'm guessing it will
 just invalidate its copy, losing the modifications.  I wouldn't
 be terribly surprised if reads are working OK (the L1 should
 snoop those and respond if it's the owner), and of course it's
 all OK if the L1 doesn't have a copy of the block.

 So maybe there is a relatively easy way to make this work, but
 figuring out whether that's true and then testing it is still a
 non-trivial amount of effort.

 Steve

 On Tue, Nov 23, 2010 at 7:57 AM, Steve Reinhardt
 ste...@gmail.com mailto:ste...@gmail.com wrote:

 No, when the L2 receives a request it assumes the L1s above
 it have already been snooped, which is true since the request
 came in on the bus that the L1s snoop.  The issue is that
 caches don't necessarily behave correctly when
 non-cache-block requests come in through their mem-side
 (snoop) port and not through their cpu-side (request) port. 
 I'm guessing this could be made to work, I'd just be very
 surprised if it does right now, since the caches weren't
 designed to deal with this case and aren't tested this way.

 Steve


 On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi sa...@umich.edu
 mailto:sa...@umich.edu wrote:

 Does it? Shouldn't the l2 receive the request, ask for
 the block and end up snooping the l1s?

  

 Ali

  

  

 On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt
 ste...@gmail.com mailto:ste...@gmail.com wrote:

 The point is that connecting between the L1 and L2
 induces the same problems wrt the L1 that connecting
 directly to memory induces wrt the whole cache
 hierarchy.  You're just statistically more likely to
 get away with it in the former case because the L1 is
 smaller.

 Steve

 On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi
 sa...@umich.edu mailto:sa...@umich.edu wrote:


 Where are you connecting the table walker? If
 it's between the l1 and l2 my guess is that it
 will work. if it is to the memory bus, yes,
 memory is just responding without the help of a
 cache and this could be 

Re: [m5-dev] X86 FS regression

2010-12-13 Thread Ali Saidi
I've got a patch that gets closer to supporting caches between the TLB and L2 
cache, but it doesn't work. Since we don't have a way to invalidate addresses 
out of the cache if you switch a memory region to uncachable, the address 
remains in the cache and causes all sorts of havoc. If you want to make some 
changes, please make them x86 only for now. We'll need to implement some form 
of cache cleaning or invalidating before it can be supported on ARM.

Thanks,
Ali

On Dec 13, 2010, at 4:56 AM, Gabe Black wrote:

 I finally got around to trying this out (patch attached) and it seemed
 to fix x86. This change seems to break ARM_FS, though. It faults when it
 tries to execute code at the fault vector because the page table entry
 supposedly is marked no execute (I think). That makes the timing CPU
 spin around and around because it keeps getting a fault, invoking it,
 and attempting to fetch again. The call stack never hits a point where
 it has to wait for an event, so it never collapses back down and
 recurses until the stack is too big and M5 segfaults. It seems to just
 get lost with the atomic CPU and I'm not entirely sure what's going on
 there, although I suspect the atomic CPU is just structured differently
 and doesn't recurse infinitely.
 
 I wanted to ask the ARM folks if they knew what was going on here. Is
 something about the page table walk supposed to be uncached but isn't?
 This seems to work without that cache added in, so I suspect the walker
 is picking up stale data or something.
 
 Gabe
 
 Gabe Black wrote:
 Of these, I think the walker cache sounds better for two reasons. First,
 it avoids the L1 pollution Ali was talking about, and second, a new bus
 would add mostly inert stuff on the way to memory and which would
 involve looking up what port to use even though it'd always be the same
 one. I'll give that a try.
 
 Gabe
 
 Steve Reinhardt wrote:
 
 I think the two easy (python-only) solutions are sharing the existing
 L1 via a bus and tacking on a small L1 to the walker.  Which one is
 more realistic would depend on what you're trying to model.
 
 Steve
 
 On Tue, Nov 23, 2010 at 8:23 AM, Ali Saidi sa...@umich.edu
 mailto:sa...@umich.edu wrote:
 
So what is the relatively good way to make this work in the short
term? A bus? What about the slightly better version? I suppose a
small cache might be ok and probably somewhat realistic.
 
 
 
Thanks,
 
Ali
 
 
 
 
 
On Tue, 23 Nov 2010 08:15:01 -0800, Steve Reinhardt
ste...@gmail.com mailto:ste...@gmail.com wrote:
 
 
And even though I do think it could be made to work, I'm not sure
it would be easy or a good idea.  There are a lot of corner cases
to worry about, especially for writes, since you'd have to
actually buffer the write data somewhere as opposed to just
remembering that so-and-so has requested an exclusive copy.
 
Actually as I think about it, that might be the case that's
breaking now... if the L1 has an exclusive copy and then it
snoops a write (and not a read-exclusive), I'm guessing it will
just invalidate its copy, losing the modifications.  I wouldn't
be terribly surprised if reads are working OK (the L1 should
snoop those and respond if it's the owner), and of course it's
all OK if the L1 doesn't have a copy of the block.
 
So maybe there is a relatively easy way to make this work, but
figuring out whether that's true and then testing it is still a
non-trivial amount of effort.
 
Steve
 
On Tue, Nov 23, 2010 at 7:57 AM, Steve Reinhardt
ste...@gmail.com mailto:ste...@gmail.com wrote:
 
No, when the L2 receives a request it assumes the L1s above
it have already been snooped, which is true since the request
came in on the bus that the L1s snoop.  The issue is that
caches don't necessarily behave correctly when
non-cache-block requests come in through their mem-side
(snoop) port and not through their cpu-side (request) port. 
I'm guessing this could be made to work, I'd just be very
surprised if it does right now, since the caches weren't
designed to deal with this case and aren't tested this way.
 
Steve
 
 
On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi sa...@umich.edu
mailto:sa...@umich.edu wrote:
 
Does it? Shouldn't the l2 receive the request, ask for
the block and end up snooping the l1s?
 
 
 
Ali
 
 
 
 
 
On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt
ste...@gmail.com mailto:ste...@gmail.com wrote:
 
The point is that connecting between the L1 and L2
induces the same problems wrt the L1 that connecting
directly to memory induces wrt the whole cache
hierarchy.  You're just statistically more likely to
get away with it in the former case because the L1 is
smaller.
 
   

Re: [m5-dev] X86 FS regression

2010-12-06 Thread Steve Reinhardt
On Wed, Dec 1, 2010 at 3:07 PM, Ali Saidi sa...@umich.edu wrote:

 Continuing the e-mail thread that never dies

 It appears as though the dcache some how does the correct thing when a read
 request comes into the l2 bus.  Note that the dcache is snooping the
 request.
 Listening for system connection on port 3456

  481100500: system.cpu.dtb.walker: Begining table walk for address
 0xc020, TTBCR: 0, bits:0
  481100500: system.cpu.dtb.walker:  - Selecting TTBR0
  481100500: system.cpu.dtb.walker:  - Descriptor at address 0x7008
 481100500: system.tol2bus: recvAtomic: packet src 4 dest -1 addr 0x7008 cmd
 ReadReq
 481100500: system.cpu.dcache: snooped a ReadReq request for addr 7000,
 responding, new state is 13
 481100500: system.l2: rcvd mem-inhibited ReadReq on 0x7008: not responding
 481100500: system.cpu.dtb.walker: L1 descriptor for 0xc020 is 0x20040e

 After some work I managed to get a cache to work in this case too The
 table walker has to kick off a sendStatusChange() otherwise the cache below
 it doesn't  get added to the snooping list of the tol2bus.


I'm not too surprised that reads work, but what about writes (e.g., if the
TLB walker sets an accessed bit)?

Steve
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] X86 FS regression

2010-12-01 Thread Ali Saidi


Continuing the e-mail thread that never dies 

It appears as
though the dcache some how does the correct thing when a read request
comes into the l2 bus. Note that the dcache is snooping the request.

Listening for system connection on port 3456

 481100500:
system.cpu.dtb.walker: Begining table walk for address 0xc020,
TTBCR: 0, bits:0
 481100500: system.cpu.dtb.walker: - Selecting TTBR0

481100500: system.cpu.dtb.walker: - Descriptor at address
0x7008
481100500: system.tol2bus: recvAtomic: packet src 4 dest -1 addr
0x7008 cmd ReadReq
481100500: system.cpu.dcache: snooped a ReadReq
request for addr 7000, responding, new state is 13
481100500: system.l2:
rcvd mem-inhibited ReadReq on 0x7008: not responding
481100500:
system.cpu.dtb.walker: L1 descriptor for 0xc020 is 0x20040e

After
some work I managed to get a cache to work in this case too The
table walker has to kick off a sendStatusChange() otherwise the cache
below it doesn't get added to the snooping list of the tol2bus. 

Ali


On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt  wrote:  

The
point is that connecting between the L1 and L2 induces the same problems
wrt the L1 that connecting directly to memory induces wrt the whole
cache hierarchy. You're just statistically more likely to get away with
it in the former case because the L1 is smaller.

Steve

On Tue, Nov 23,
2010 at 7:16 AM, Ali Saidi  wrote:

 Where are you connecting the table
walker? If it's between the l1 and l2 my guess is that it will work. if
it is to the memory bus, yes, memory is just responding without the help
of a cache and this could be the reason.

 Ali 

 On Tue, 23 Nov 2010
06:29:20 -0500, Gabe Black  wrote:
 I think I may have just now. I've
fixed a few issues, and am now getting
 to the point where something
that should be in the pagetables is causing
 a page fault. I found where
the table walker is walking the tables for
 this particular access, and
the last level entry is all 0s. There could
 be a number of reasons this
is all 0s, but since the main difference
 other than timing between this
and a working configuration is the
 presence of caches and we've
identified a potential issue there, I'm
 inclined to suspect the actual
page table entry is still in the L1 and
 hasn't been evicted out to
memory yet.

 To fix this, is the best solution to add a bus below the
CPU for all the
 connections that need to go to the L1? I'm assuming
they'd all go into
 the dcache since they're more data-ey and that keeps
the icache read
 only (ignoring SMC issues), and the dcache is probably
servicing lower
 bandwidth normally. It also seems a little strange that
this type of
 configuration is going on in the BaseCPU.py SimObject
python file and
 not a configuration file, but I could be convinced
there's a reason.
 Even if this isn't really a fix or the right
thing to do, I'd still
 like to try it temporarily at least to see if
it corrects the problem
 I'm seeing.

 Gabe

 Ali Saidi wrote:

 I
haven't seen any strange behavior yet. That isn't to say it's not
 going
to cause an issue in the future, but we've taken many a tlb miss
 and it
hasn't fallen over yet.

 Ali

 On Mon, 22 Nov 2010 13:08:13 -0800,
Steve Reinhardt 
 wrote:

 Yea, I just got around to reading this thread
and that was the point
 I was going to make... the L1 cache effectively
serves as a
 translator between the CPU's word-size read  write
requests and the
 coherent block-level requests that get snooped. If you
attach a
 CPU-like device (such as the table walker) directly to an L2,
the
 CPU-like accesses that go to the L2 will get sent to the L1s but
I'm
 not sure they'll be handled correctly. Not that they fundamentally

couldn't, this just isn't a configuration we test so it's likely that

there are problems... for example, the L1 may try to hand ownership
 to
the requester but the requester won't recognize that and things
 will
break.

 Steve

 On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black  wrote:


What happens if an entry is in the L1 but not the L2?

 Gabe

 Ali Saidi
wrote:
  Between the l1 and l2 caches seems like a good place to me.
The
 caches can cache page table entries, otherwise a tlb miss would
 be
even more expensive then it is. The l1 isn't normally used for
 such
things since it would get polluted (look why sparc has a
 load 128bits
from l2, do not allocate into l1 instruction).
 
  Ali
 
  On Nov
22, 2010, at 4:27 AM, Gabe Black wrote:
 
 
  For anybody waiting
for an x86 FS regression (yes, I know,
 you can
  all hardly wait, but
don't let this spoil your Thanksgiving)
 I'm getting
  closer to
having it working, but I've discovered some issues
 with the
 
mechanisms behind the --caches flag with fs.py and x86. I'm
 surprised
I
  never thought to try it before. It also brings up some
 questions
about
  where the table walkers should be hooked up in x86 and ARM.

Currently
  it's after the L1, if any, but before the L2, if any,
which
 seems wrong
  to me. Also caches don't seem 

Re: [m5-dev] X86 FS regression

2010-11-23 Thread Gabe Black
I see that the bridge and cache are in parallel like you're describing.
The culprit seems to be this line:

configs/example/fs.py:test_sys.bridge.filter_ranges_a=[AddrRange(0,
Addr.max)]

where the bridge is being told explicitly not to let anything through
from the IO side to the memory side. That should be fairly
straightforward to poke a hole in for the necessary ranges. The
corresponding line for the other direction (below) brings up another
question. What happens if the bridge doesn't disallow something to go
across and something else wants to respond to an address? The bridge
isn't set to ignore APIC messages implementing IPIs between CPUs, but
those seem to be going between CPUs and not out into the IO system. Are
we just getting lucky? This same thing would seem to apply to any other
memory side object that isn't in the address range 0-mem_size.

configs/example/fs.py:   
test_sys.bridge.filter_ranges_b=[AddrRange(mem_size)]

Gabe

Steve Reinhardt wrote:
 I believe the I/O cache is normally paired with a bridge that lets
 things flow in the other direction.  It's really just designed to
 handle accesses to cacheable space from devices on the I/O bus without
 requiring each device to have a cache.  It's possible we've never had
 a situation before where I/O devices issue accesses to uncacheable
 non-memory locations on the CPU side of the I/O cache, in which case I
 would not be terribly surprised if that didn't quite work.

 Steve

 On Mon, Nov 22, 2010 at 11:59 AM, Gabe Black gbl...@eecs.umich.edu
 mailto:gbl...@eecs.umich.edu wrote:

 The cache claims to support all addresses on the CPU side (or so says
 the comments), but no addresses on the memory side. Messages going
 from
 the IO interrupt controller get to the IO bus but then don't know
 where
 to go since the IO cache hides the fact that the CPU interrupt
 controller wants to receive messages on that address range. I also
 don't
 know if the cache can handle messages passing through originating from
 the memory side, but I didn't look into that.

 Gabe

 Ali Saidi wrote:
  Something has to maintain i/o coherency and that something looks
 an whole lot like a couple line cache. Why is having a cache there
 any issue, they should pass right through the cache?
 
  Ali
 
 
 
  On Nov 22, 2010, at 4:42 AM, Gabe Black wrote:
 
 
  Hmm. It looks like this IO cache is only added when there are
 caches in
  the system (a fix for some coherency something? I sort of
 remember that
  discussion.) and that wouldn't propagate to the IO bus the fact
 that the
  CPU's local APIC wanted to receive interrupt messages passed
 over the
  memory system. I don't know the intricacies of why the IO cache was
  necessary, or what problems passing requests back up through
 the cache
  might cause, but this is a serious issue for x86 and any other
 ISA that
  wants to move to a message based interrupt scheme. I suppose the
  interrupt objects could be connected all the way out onto the
 IO bus
  itself, bypassing that cache, but I'm not sure how realistic
 that is.
 
  Gabe Black wrote:
 
 For anybody waiting for an x86 FS regression (yes, I know,
 you can
  all hardly wait, but don't let this spoil your Thanksgiving)
 I'm getting
  closer to having it working, but I've discovered some issues
 with the
  mechanisms behind the --caches flag with fs.py and x86. I'm
 surprised I
  never thought to try it before. It also brings up some
 questions about
  where the table walkers should be hooked up in x86 and ARM.
 Currently
  it's after the L1, if any, but before the L2, if any, which
 seems wrong
  to me. Also caches don't seem to propagate requests upwards to
 the CPUs
  which may or may not be an issue. I'm still looking into that.
 
  Gabe
  ___
  m5-dev mailing list
  m5-dev@m5sim.org mailto:m5-dev@m5sim.org
  http://m5sim.org/mailman/listinfo/m5-dev
 
 
  ___
  m5-dev mailing list
  m5-dev@m5sim.org mailto:m5-dev@m5sim.org
  http://m5sim.org/mailman/listinfo/m5-dev
 
 
 
  ___
  m5-dev mailing list
  m5-dev@m5sim.org mailto:m5-dev@m5sim.org
  http://m5sim.org/mailman/listinfo/m5-dev
 

 ___
 m5-dev mailing list
 m5-dev@m5sim.org mailto:m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev


 

 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev
   

___

Re: [m5-dev] X86 FS regression

2010-11-23 Thread Gabe Black
I think I may have just now. I've fixed a few issues, and am now getting
to the point where something that should be in the pagetables is causing
a page fault. I found where the table walker is walking the tables for
this particular access, and the last level entry is all 0s. There could
be a number of reasons this is all 0s, but since the main difference
other than timing between this and a working configuration is the
presence of caches and we've identified a potential issue there, I'm
inclined to suspect the actual page table entry is still in the L1 and
hasn't been evicted out to memory yet.

To fix this, is the best solution to add a bus below the CPU for all the
connections that need to go to the L1? I'm assuming they'd all go into
the dcache since they're more data-ey and that keeps the icache read
only (ignoring SMC issues), and the dcache is probably servicing lower
bandwidth normally. It also seems a little strange that this type of
configuration is going on in the BaseCPU.py SimObject python file and
not a configuration file, but I could be convinced there's a reason.
Even if this isn't really a fix or the right thing to do, I'd still
like to try it temporarily at least to see if it corrects the problem
I'm seeing.

Gabe

Ali Saidi wrote:

 I haven't seen any strange behavior yet. That isn't to say it's not
 going to cause an issue in the future, but we've taken many a tlb miss
 and it hasn't fallen over yet.

 Ali

 On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt ste...@gmail.com
 wrote:

 Yea, I just got around to reading this thread and that was the point
 I was going to make... the L1 cache effectively serves as a
 translator between the CPU's word-size read  write requests and the
 coherent block-level requests that get snooped.  If you attach a
 CPU-like device (such as the table walker) directly to an L2, the
 CPU-like accesses that go to the L2 will get sent to the L1s but I'm
 not sure they'll be handled correctly.  Not that they fundamentally
 couldn't, this just isn't a configuration we test so it's likely that
 there are problems... for example, the L1 may try to hand ownership
 to the requester but the requester won't recognize that and things
 will break.

 Steve

 On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black gbl...@eecs.umich.edu
 mailto:gbl...@eecs.umich.edu wrote:

 What happens if an entry is in the L1 but not the L2?

 Gabe

 Ali Saidi wrote:
  Between the l1 and l2 caches seems like a good place to me. The
 caches can cache page table entries, otherwise a tlb miss would
 be even more expensive then it is. The l1 isn't normally used for
 such things since it would get polluted (look why sparc has a
 load 128bits from l2, do not allocate into l1 instruction).
 
  Ali
 
  On Nov 22, 2010, at 4:27 AM, Gabe Black wrote:
 
 
 For anybody waiting for an x86 FS regression (yes, I know,
 you can
  all hardly wait, but don't let this spoil your Thanksgiving)
 I'm getting
  closer to having it working, but I've discovered some issues
 with the
  mechanisms behind the --caches flag with fs.py and x86. I'm
 surprised I
  never thought to try it before. It also brings up some
 questions about
  where the table walkers should be hooked up in x86 and ARM.
 Currently
  it's after the L1, if any, but before the L2, if any, which
 seems wrong
  to me. Also caches don't seem to propagate requests upwards to
 the CPUs
  which may or may not be an issue. I'm still looking into that.
 
  Gabe
  ___
  m5-dev mailing list
  m5-dev@m5sim.org mailto:m5-dev@m5sim.org
  http://m5sim.org/mailman/listinfo/m5-dev
 
 
 
  ___
  m5-dev mailing list
  m5-dev@m5sim.org mailto:m5-dev@m5sim.org
  http://m5sim.org/mailman/listinfo/m5-dev
 

 ___
 m5-dev mailing list
 m5-dev@m5sim.org mailto:m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev


  

 

 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev
   

___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] X86 FS regression

2010-11-23 Thread Ali Saidi


Where are you connecting the table walker? If it's between the l1 and 
l2 my guess is that it will work. if it is to the memory bus, yes, 
memory is just responding without the help of a cache and this could be 
the reason.


Ali


On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black gbl...@eecs.umich.edu 
wrote:
I think I may have just now. I've fixed a few issues, and am now 
getting
to the point where something that should be in the pagetables is 
causing
a page fault. I found where the table walker is walking the tables 
for
this particular access, and the last level entry is all 0s. There 
could

be a number of reasons this is all 0s, but since the main difference
other than timing between this and a working configuration is the
presence of caches and we've identified a potential issue there, I'm
inclined to suspect the actual page table entry is still in the L1 
and

hasn't been evicted out to memory yet.

To fix this, is the best solution to add a bus below the CPU for all 
the
connections that need to go to the L1? I'm assuming they'd all go 
into

the dcache since they're more data-ey and that keeps the icache read
only (ignoring SMC issues), and the dcache is probably servicing 
lower

bandwidth normally. It also seems a little strange that this type of
configuration is going on in the BaseCPU.py SimObject python file and
not a configuration file, but I could be convinced there's a reason.
Even if this isn't really a fix or the right thing to do, I'd 
still

like to try it temporarily at least to see if it corrects the problem
I'm seeing.

Gabe

Ali Saidi wrote:


I haven't seen any strange behavior yet. That isn't to say it's not
going to cause an issue in the future, but we've taken many a tlb 
miss

and it hasn't fallen over yet.

Ali

On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt 
ste...@gmail.com

wrote:

Yea, I just got around to reading this thread and that was the 
point

I was going to make... the L1 cache effectively serves as a
translator between the CPU's word-size read  write requests and 
the

coherent block-level requests that get snooped.  If you attach a
CPU-like device (such as the table walker) directly to an L2, the
CPU-like accesses that go to the L2 will get sent to the L1s but 
I'm

not sure they'll be handled correctly.  Not that they fundamentally
couldn't, this just isn't a configuration we test so it's likely 
that

there are problems... for example, the L1 may try to hand ownership
to the requester but the requester won't recognize that and things
will break.

Steve

On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black gbl...@eecs.umich.edu
mailto:gbl...@eecs.umich.edu wrote:

What happens if an entry is in the L1 but not the L2?

Gabe

Ali Saidi wrote:
 Between the l1 and l2 caches seems like a good place to me. 
The

caches can cache page table entries, otherwise a tlb miss would
be even more expensive then it is. The l1 isn't normally used 
for

such things since it would get polluted (look why sparc has a
load 128bits from l2, do not allocate into l1 instruction).

 Ali

 On Nov 22, 2010, at 4:27 AM, Gabe Black wrote:


For anybody waiting for an x86 FS regression (yes, I 
know,

you can
 all hardly wait, but don't let this spoil your Thanksgiving)
I'm getting
 closer to having it working, but I've discovered some issues
with the
 mechanisms behind the --caches flag with fs.py and x86. I'm
surprised I
 never thought to try it before. It also brings up some
questions about
 where the table walkers should be hooked up in x86 and ARM.
Currently
 it's after the L1, if any, but before the L2, if any, which
seems wrong
 to me. Also caches don't seem to propagate requests upwards 
to

the CPUs
 which may or may not be an issue. I'm still looking into 
that.


 Gabe
 ___
 m5-dev mailing list
 m5-dev@m5sim.org mailto:m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev



 ___
 m5-dev mailing list
 m5-dev@m5sim.org mailto:m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev


___
m5-dev mailing list
m5-dev@m5sim.org mailto:m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev








___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev



___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] X86 FS regression

2010-11-23 Thread Steve Reinhardt
IIRC, the filter works in conjunction with the address range autodetection
stuff, so in order for a memory request to go across the bridge, the
targeted address must lie on the other side *and* not be filtered out.  I
expect this explains why IPIs aren't going across.

Thinking about it, I'm not sure why the I/O cache doesn't let uncached
accesses through from the I/O side to the memory side, assuming the target
exists on the memory side.  CPU caches certainly let uncached accesses
through, and it's the same cache module in both cases.  Hmm, looking at
fs.py, I think this line may be as much of a culprit as the others:

test_sys.iocache = IOCache(addr_range=mem_size)

I believe the address range exclusions are necessary to avoid an infinite
loop between the iocache and the bridge in the address range autodetection
algorithm, but perhaps the ranges are set up a little too conservatively so
that uncacheable addresses have no way through.  I don't think it matters
whether you open up the range in the iocache or in the bridge to let them
through, as long as (1) you only do one and not the other and (2) it's
selective enough that it doesn't include any PCI addresses that might result
in a loop.

Steve

On Tue, Nov 23, 2010 at 12:17 AM, Gabe Black gbl...@eecs.umich.edu wrote:

 I see that the bridge and cache are in parallel like you're describing.
 The culprit seems to be this line:

 configs/example/fs.py:test_sys.bridge.filter_ranges_a=[AddrRange(0,
 Addr.max)]

 where the bridge is being told explicitly not to let anything through
 from the IO side to the memory side. That should be fairly
 straightforward to poke a hole in for the necessary ranges. The
 corresponding line for the other direction (below) brings up another
 question. What happens if the bridge doesn't disallow something to go
 across and something else wants to respond to an address? The bridge
 isn't set to ignore APIC messages implementing IPIs between CPUs, but
 those seem to be going between CPUs and not out into the IO system. Are
 we just getting lucky? This same thing would seem to apply to any other
 memory side object that isn't in the address range 0-mem_size.

 configs/example/fs.py:
 test_sys.bridge.filter_ranges_b=[AddrRange(mem_size)]

 Gabe

 Steve Reinhardt wrote:
  I believe the I/O cache is normally paired with a bridge that lets
  things flow in the other direction.  It's really just designed to
  handle accesses to cacheable space from devices on the I/O bus without
  requiring each device to have a cache.  It's possible we've never had
  a situation before where I/O devices issue accesses to uncacheable
  non-memory locations on the CPU side of the I/O cache, in which case I
  would not be terribly surprised if that didn't quite work.
 
  Steve
 
  On Mon, Nov 22, 2010 at 11:59 AM, Gabe Black gbl...@eecs.umich.edu
  mailto:gbl...@eecs.umich.edu wrote:
 
  The cache claims to support all addresses on the CPU side (or so says
  the comments), but no addresses on the memory side. Messages going
  from
  the IO interrupt controller get to the IO bus but then don't know
  where
  to go since the IO cache hides the fact that the CPU interrupt
  controller wants to receive messages on that address range. I also
  don't
  know if the cache can handle messages passing through originating
 from
  the memory side, but I didn't look into that.
 
  Gabe
 
  Ali Saidi wrote:
   Something has to maintain i/o coherency and that something looks
  an whole lot like a couple line cache. Why is having a cache there
  any issue, they should pass right through the cache?
  
   Ali
  
  
  
   On Nov 22, 2010, at 4:42 AM, Gabe Black wrote:
  
  
   Hmm. It looks like this IO cache is only added when there are
  caches in
   the system (a fix for some coherency something? I sort of
  remember that
   discussion.) and that wouldn't propagate to the IO bus the fact
  that the
   CPU's local APIC wanted to receive interrupt messages passed
  over the
   memory system. I don't know the intricacies of why the IO cache
 was
   necessary, or what problems passing requests back up through
  the cache
   might cause, but this is a serious issue for x86 and any other
  ISA that
   wants to move to a message based interrupt scheme. I suppose the
   interrupt objects could be connected all the way out onto the
  IO bus
   itself, bypassing that cache, but I'm not sure how realistic
  that is.
  
   Gabe Black wrote:
  
  For anybody waiting for an x86 FS regression (yes, I know,
  you can
   all hardly wait, but don't let this spoil your Thanksgiving)
  I'm getting
   closer to having it working, but I've discovered some issues
  with the
   mechanisms behind the --caches flag with fs.py and x86. I'm
  surprised I
   never thought to 

Re: [m5-dev] X86 FS regression

2010-11-23 Thread Steve Reinhardt
The point is that connecting between the L1 and L2 induces the same problems
wrt the L1 that connecting directly to memory induces wrt the whole cache
hierarchy.  You're just statistically more likely to get away with it in the
former case because the L1 is smaller.

Steve

On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi sa...@umich.edu wrote:


 Where are you connecting the table walker? If it's between the l1 and l2 my
 guess is that it will work. if it is to the memory bus, yes, memory is just
 responding without the help of a cache and this could be the reason.

 Ali



 On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black gbl...@eecs.umich.edu
 wrote:

 I think I may have just now. I've fixed a few issues, and am now getting
 to the point where something that should be in the pagetables is causing
 a page fault. I found where the table walker is walking the tables for
 this particular access, and the last level entry is all 0s. There could
 be a number of reasons this is all 0s, but since the main difference
 other than timing between this and a working configuration is the
 presence of caches and we've identified a potential issue there, I'm
 inclined to suspect the actual page table entry is still in the L1 and
 hasn't been evicted out to memory yet.

 To fix this, is the best solution to add a bus below the CPU for all the
 connections that need to go to the L1? I'm assuming they'd all go into
 the dcache since they're more data-ey and that keeps the icache read
 only (ignoring SMC issues), and the dcache is probably servicing lower
 bandwidth normally. It also seems a little strange that this type of
 configuration is going on in the BaseCPU.py SimObject python file and
 not a configuration file, but I could be convinced there's a reason.
 Even if this isn't really a fix or the right thing to do, I'd still
 like to try it temporarily at least to see if it corrects the problem
 I'm seeing.

 Gabe

 Ali Saidi wrote:


 I haven't seen any strange behavior yet. That isn't to say it's not
 going to cause an issue in the future, but we've taken many a tlb miss
 and it hasn't fallen over yet.

 Ali

 On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt ste...@gmail.com
 wrote:

  Yea, I just got around to reading this thread and that was the point
 I was going to make... the L1 cache effectively serves as a
 translator between the CPU's word-size read  write requests and the
 coherent block-level requests that get snooped.  If you attach a
 CPU-like device (such as the table walker) directly to an L2, the
 CPU-like accesses that go to the L2 will get sent to the L1s but I'm
 not sure they'll be handled correctly.  Not that they fundamentally
 couldn't, this just isn't a configuration we test so it's likely that
 there are problems... for example, the L1 may try to hand ownership
 to the requester but the requester won't recognize that and things
 will break.

 Steve

 On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black gbl...@eecs.umich.edu
 mailto:gbl...@eecs.umich.edu wrote:

What happens if an entry is in the L1 but not the L2?

Gabe

Ali Saidi wrote:
 Between the l1 and l2 caches seems like a good place to me. The
caches can cache page table entries, otherwise a tlb miss would
be even more expensive then it is. The l1 isn't normally used for
such things since it would get polluted (look why sparc has a
load 128bits from l2, do not allocate into l1 instruction).

 Ali

 On Nov 22, 2010, at 4:27 AM, Gabe Black wrote:


For anybody waiting for an x86 FS regression (yes, I know,
you can
 all hardly wait, but don't let this spoil your Thanksgiving)
I'm getting
 closer to having it working, but I've discovered some issues
with the
 mechanisms behind the --caches flag with fs.py and x86. I'm
surprised I
 never thought to try it before. It also brings up some
questions about
 where the table walkers should be hooked up in x86 and ARM.
Currently
 it's after the L1, if any, but before the L2, if any, which
seems wrong
 to me. Also caches don't seem to propagate requests upwards to
the CPUs
 which may or may not be an issue. I'm still looking into that.

 Gabe
 ___
 m5-dev mailing list
 m5-dev@m5sim.org mailto:m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev



 ___
 m5-dev mailing list
 m5-dev@m5sim.org mailto:m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev


___
m5-dev mailing list
m5-dev@m5sim.org mailto:m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev





 

 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev


 

Re: [m5-dev] X86 FS regression

2010-11-23 Thread Steve Reinhardt
I definitely agree that putting a bus between the CPU and L1 and plugging
the table walker in there is the best way to figure out if this is really
the problem (and I expect it is).

I'm not sure if it's the long-term right answer or not.  We also need to
consider how this works with Ruby.

Steve

On Tue, Nov 23, 2010 at 3:29 AM, Gabe Black gbl...@eecs.umich.edu wrote:

 I think I may have just now. I've fixed a few issues, and am now getting
 to the point where something that should be in the pagetables is causing
 a page fault. I found where the table walker is walking the tables for
 this particular access, and the last level entry is all 0s. There could
 be a number of reasons this is all 0s, but since the main difference
 other than timing between this and a working configuration is the
 presence of caches and we've identified a potential issue there, I'm
 inclined to suspect the actual page table entry is still in the L1 and
 hasn't been evicted out to memory yet.

 To fix this, is the best solution to add a bus below the CPU for all the
 connections that need to go to the L1? I'm assuming they'd all go into
 the dcache since they're more data-ey and that keeps the icache read
 only (ignoring SMC issues), and the dcache is probably servicing lower
 bandwidth normally. It also seems a little strange that this type of
 configuration is going on in the BaseCPU.py SimObject python file and
 not a configuration file, but I could be convinced there's a reason.
 Even if this isn't really a fix or the right thing to do, I'd still
 like to try it temporarily at least to see if it corrects the problem
 I'm seeing.

 Gabe

 Ali Saidi wrote:
 
  I haven't seen any strange behavior yet. That isn't to say it's not
  going to cause an issue in the future, but we've taken many a tlb miss
  and it hasn't fallen over yet.
 
  Ali
 
  On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt ste...@gmail.com
  wrote:
 
  Yea, I just got around to reading this thread and that was the point
  I was going to make... the L1 cache effectively serves as a
  translator between the CPU's word-size read  write requests and the
  coherent block-level requests that get snooped.  If you attach a
  CPU-like device (such as the table walker) directly to an L2, the
  CPU-like accesses that go to the L2 will get sent to the L1s but I'm
  not sure they'll be handled correctly.  Not that they fundamentally
  couldn't, this just isn't a configuration we test so it's likely that
  there are problems... for example, the L1 may try to hand ownership
  to the requester but the requester won't recognize that and things
  will break.
 
  Steve
 
  On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black gbl...@eecs.umich.edu
  mailto:gbl...@eecs.umich.edu wrote:
 
  What happens if an entry is in the L1 but not the L2?
 
  Gabe
 
  Ali Saidi wrote:
   Between the l1 and l2 caches seems like a good place to me. The
  caches can cache page table entries, otherwise a tlb miss would
  be even more expensive then it is. The l1 isn't normally used for
  such things since it would get polluted (look why sparc has a
  load 128bits from l2, do not allocate into l1 instruction).
  
   Ali
  
   On Nov 22, 2010, at 4:27 AM, Gabe Black wrote:
  
  
  For anybody waiting for an x86 FS regression (yes, I know,
  you can
   all hardly wait, but don't let this spoil your Thanksgiving)
  I'm getting
   closer to having it working, but I've discovered some issues
  with the
   mechanisms behind the --caches flag with fs.py and x86. I'm
  surprised I
   never thought to try it before. It also brings up some
  questions about
   where the table walkers should be hooked up in x86 and ARM.
  Currently
   it's after the L1, if any, but before the L2, if any, which
  seems wrong
   to me. Also caches don't seem to propagate requests upwards to
  the CPUs
   which may or may not be an issue. I'm still looking into that.
  
   Gabe
   ___
   m5-dev mailing list
   m5-dev@m5sim.org mailto:m5-dev@m5sim.org
   http://m5sim.org/mailman/listinfo/m5-dev
  
  
  
   ___
   m5-dev mailing list
   m5-dev@m5sim.org mailto:m5-dev@m5sim.org
   http://m5sim.org/mailman/listinfo/m5-dev
  
 
  ___
  m5-dev mailing list
  m5-dev@m5sim.org mailto:m5-dev@m5sim.org
  http://m5sim.org/mailman/listinfo/m5-dev
 
 
 
 
  
 
  ___
  m5-dev mailing list
  m5-dev@m5sim.org
  http://m5sim.org/mailman/listinfo/m5-dev
 

 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] X86 FS regression

2010-11-23 Thread Ali Saidi


Does it? Shouldn't the l2 receive the request, ask for the block and
end up snooping the l1s? 

Ali 

On Tue, 23 Nov 2010 07:30:00 -0800,
Steve Reinhardt  wrote:  

The point is that connecting between the L1
and L2 induces the same problems wrt the L1 that connecting directly to
memory induces wrt the whole cache hierarchy. You're just statistically
more likely to get away with it in the former case because the L1 is
smaller.

Steve

On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi  wrote:


Where are you connecting the table walker? If it's between the l1 and l2
my guess is that it will work. if it is to the memory bus, yes, memory
is just responding without the help of a cache and this could be the
reason.

 Ali 

 On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black  wrote:

I think I may have just now. I've fixed a few issues, and am now
getting
 to the point where something that should be in the pagetables
is causing
 a page fault. I found where the table walker is walking the
tables for
 this particular access, and the last level entry is all 0s.
There could
 be a number of reasons this is all 0s, but since the main
difference
 other than timing between this and a working configuration
is the
 presence of caches and we've identified a potential issue there,
I'm
 inclined to suspect the actual page table entry is still in the L1
and
 hasn't been evicted out to memory yet.

 To fix this, is the best
solution to add a bus below the CPU for all the
 connections that need
to go to the L1? I'm assuming they'd all go into
 the dcache since
they're more data-ey and that keeps the icache read
 only (ignoring SMC
issues), and the dcache is probably servicing lower
 bandwidth normally.
It also seems a little strange that this type of
 configuration is going
on in the BaseCPU.py SimObject python file and
 not a configuration
file, but I could be convinced there's a reason.
 Even if this isn't
really a fix or the right thing to do, I'd still
 like to try it
temporarily at least to see if it corrects the problem
 I'm seeing.


Gabe

 Ali Saidi wrote:

 I haven't seen any strange behavior yet. That
isn't to say it's not
 going to cause an issue in the future, but we've
taken many a tlb miss
 and it hasn't fallen over yet.

 Ali

 On Mon, 22
Nov 2010 13:08:13 -0800, Steve Reinhardt 
 wrote:

 Yea, I just got
around to reading this thread and that was the point
 I was going to
make... the L1 cache effectively serves as a
 translator between the
CPU's word-size read  write requests and the
 coherent block-level
requests that get snooped. If you attach a
 CPU-like device (such as the
table walker) directly to an L2, the
 CPU-like accesses that go to the
L2 will get sent to the L1s but I'm
 not sure they'll be handled
correctly. Not that they fundamentally
 couldn't, this just isn't a
configuration we test so it's likely that
 there are problems... for
example, the L1 may try to hand ownership
 to the requester but the
requester won't recognize that and things
 will break.

 Steve

 On Mon,
Nov 22, 2010 at 12:00 PM, Gabe Black  wrote:

 What happens if an entry
is in the L1 but not the L2?

 Gabe

 Ali Saidi wrote:
  Between the l1
and l2 caches seems like a good place to me. The
 caches can cache page
table entries, otherwise a tlb miss would
 be even more expensive then
it is. The l1 isn't normally used for
 such things since it would get
polluted (look why sparc has a
 load 128bits from l2, do not allocate
into l1 instruction).
 
  Ali
 
  On Nov 22, 2010, at 4:27 AM, Gabe
Black wrote:
 
 
  For anybody waiting for an x86 FS regression
(yes, I know,
 you can
  all hardly wait, but don't let this spoil
your Thanksgiving)
 I'm getting
  closer to having it working, but
I've discovered some issues
 with the
  mechanisms behind the --caches
flag with fs.py and x86. I'm
 surprised I
  never thought to try it
before. It also brings up some
 questions about
  where the table
walkers should be hooked up in x86 and ARM.
 Currently
  it's after
the L1, if any, but before the L2, if any, which
 seems wrong
  to me.
Also caches don't seem to propagate requests upwards to
 the CPUs
 
which may or may not be an issue. I'm still looking into that.
 
 
Gabe
  ___
  m5-dev
mailing list
  m5-dev@m5sim.org [5] m5-dev@m5sim.org
 
http://m5sim.org/mailman/listinfo/m5-dev [6]
 
 
 
 
___
  m5-dev mailing list

 m5-dev@m5sim.org [7] m5-dev@m5sim.org
 
http://m5sim.org/mailman/listinfo/m5-dev [8]
 


___
 m5-dev mailing list

m5-dev@m5sim.org [9] m5-dev@m5sim.org

http://m5sim.org/mailman/listinfo/m5-dev [10]





___
 m5-dev mailing
list
m5-dev@m5sim.org [11]
http://m5sim.org/mailman/listinfo/m5-dev
[12]

 ___
 m5-dev mailing
list
m5-dev@m5sim.org [13]

Re: [m5-dev] X86 FS regression

2010-11-23 Thread Steve Reinhardt
No, when the L2 receives a request it assumes the L1s above it have already
been snooped, which is true since the request came in on the bus that the
L1s snoop.  The issue is that caches don't necessarily behave correctly when
non-cache-block requests come in through their mem-side (snoop) port and not
through their cpu-side (request) port.  I'm guessing this could be made to
work, I'd just be very surprised if it does right now, since the caches
weren't designed to deal with this case and aren't tested this way.

Steve

On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi sa...@umich.edu wrote:

 Does it? Shouldn't the l2 receive the request, ask for the block and end up
 snooping the l1s?



 Ali





 On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt ste...@gmail.com
 wrote:

 The point is that connecting between the L1 and L2 induces the same
 problems wrt the L1 that connecting directly to memory induces wrt the whole
 cache hierarchy.  You're just statistically more likely to get away with it
 in the former case because the L1 is smaller.

 Steve

 On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi sa...@umich.edu wrote:


 Where are you connecting the table walker? If it's between the l1 and l2
 my guess is that it will work. if it is to the memory bus, yes, memory is
 just responding without the help of a cache and this could be the reason.

 Ali



 On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black gbl...@eecs.umich.edu
 wrote:

 I think I may have just now. I've fixed a few issues, and am now getting
 to the point where something that should be in the pagetables is causing
 a page fault. I found where the table walker is walking the tables for
 this particular access, and the last level entry is all 0s. There could
 be a number of reasons this is all 0s, but since the main difference
 other than timing between this and a working configuration is the
 presence of caches and we've identified a potential issue there, I'm
 inclined to suspect the actual page table entry is still in the L1 and
 hasn't been evicted out to memory yet.

 To fix this, is the best solution to add a bus below the CPU for all the
 connections that need to go to the L1? I'm assuming they'd all go into
 the dcache since they're more data-ey and that keeps the icache read
 only (ignoring SMC issues), and the dcache is probably servicing lower
 bandwidth normally. It also seems a little strange that this type of
 configuration is going on in the BaseCPU.py SimObject python file and
 not a configuration file, but I could be convinced there's a reason.
 Even if this isn't really a fix or the right thing to do, I'd still
 like to try it temporarily at least to see if it corrects the problem
 I'm seeing.

 Gabe

 Ali Saidi wrote:


 I haven't seen any strange behavior yet. That isn't to say it's not
 going to cause an issue in the future, but we've taken many a tlb miss
 and it hasn't fallen over yet.

 Ali

 On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt ste...@gmail.com
 wrote:

 Yea, I just got around to reading this thread and that was the point
 I was going to make... the L1 cache effectively serves as a
 translator between the CPU's word-size read  write requests and the
 coherent block-level requests that get snooped.  If you attach a
 CPU-like device (such as the table walker) directly to an L2, the
 CPU-like accesses that go to the L2 will get sent to the L1s but I'm
 not sure they'll be handled correctly.  Not that they fundamentally
 couldn't, this just isn't a configuration we test so it's likely that
 there are problems... for example, the L1 may try to hand ownership
 to the requester but the requester won't recognize that and things
 will break.

 Steve

 On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black gbl...@eecs.umich.edu
 gbl...@eecs.umich.edu wrote:

What happens if an entry is in the L1 but not the L2?

Gabe

Ali Saidi wrote:
 Between the l1 and l2 caches seems like a good place to me. The
caches can cache page table entries, otherwise a tlb miss would
be even more expensive then it is. The l1 isn't normally used for
such things since it would get polluted (look why sparc has a
load 128bits from l2, do not allocate into l1 instruction).

 Ali

 On Nov 22, 2010, at 4:27 AM, Gabe Black wrote:


For anybody waiting for an x86 FS regression (yes, I know,
you can
 all hardly wait, but don't let this spoil your Thanksgiving)
I'm getting
 closer to having it working, but I've discovered some issues
with the
 mechanisms behind the --caches flag with fs.py and x86. I'm
surprised I
 never thought to try it before. It also brings up some
questions about
 where the table walkers should be hooked up in x86 and ARM.
Currently
 it's after the L1, if any, but before the L2, if any, which
seems wrong
 to me. Also caches don't seem to propagate requests upwards to
the CPUs
 which may or may not be an issue. I'm still 

Re: [m5-dev] X86 FS regression

2010-11-23 Thread Steve Reinhardt
And even though I do think it could be made to work, I'm not sure it would
be easy or a good idea.  There are a lot of corner cases to worry about,
especially for writes, since you'd have to actually buffer the write data
somewhere as opposed to just remembering that so-and-so has requested an
exclusive copy.

Actually as I think about it, that might be the case that's breaking now...
if the L1 has an exclusive copy and then it snoops a write (and not a
read-exclusive), I'm guessing it will just invalidate its copy, losing the
modifications.  I wouldn't be terribly surprised if reads are working OK
(the L1 should snoop those and respond if it's the owner), and of course
it's all OK if the L1 doesn't have a copy of the block.

So maybe there is a relatively easy way to make this work, but figuring out
whether that's true and then testing it is still a non-trivial amount of
effort.

Steve

On Tue, Nov 23, 2010 at 7:57 AM, Steve Reinhardt ste...@gmail.com wrote:

 No, when the L2 receives a request it assumes the L1s above it have already
 been snooped, which is true since the request came in on the bus that the
 L1s snoop.  The issue is that caches don't necessarily behave correctly when
 non-cache-block requests come in through their mem-side (snoop) port and not
 through their cpu-side (request) port.  I'm guessing this could be made to
 work, I'd just be very surprised if it does right now, since the caches
 weren't designed to deal with this case and aren't tested this way.

 Steve


 On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi sa...@umich.edu wrote:

 Does it? Shouldn't the l2 receive the request, ask for the block and end
 up snooping the l1s?



 Ali





 On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt ste...@gmail.com
 wrote:

 The point is that connecting between the L1 and L2 induces the same
 problems wrt the L1 that connecting directly to memory induces wrt the whole
 cache hierarchy.  You're just statistically more likely to get away with it
 in the former case because the L1 is smaller.

 Steve

 On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi sa...@umich.edu wrote:


 Where are you connecting the table walker? If it's between the l1 and l2
 my guess is that it will work. if it is to the memory bus, yes, memory is
 just responding without the help of a cache and this could be the reason.

 Ali



 On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black gbl...@eecs.umich.edu
 wrote:

 I think I may have just now. I've fixed a few issues, and am now getting
 to the point where something that should be in the pagetables is causing
 a page fault. I found where the table walker is walking the tables for
 this particular access, and the last level entry is all 0s. There could
 be a number of reasons this is all 0s, but since the main difference
 other than timing between this and a working configuration is the
 presence of caches and we've identified a potential issue there, I'm
 inclined to suspect the actual page table entry is still in the L1 and
 hasn't been evicted out to memory yet.

 To fix this, is the best solution to add a bus below the CPU for all the
 connections that need to go to the L1? I'm assuming they'd all go into
 the dcache since they're more data-ey and that keeps the icache read
 only (ignoring SMC issues), and the dcache is probably servicing lower
 bandwidth normally. It also seems a little strange that this type of
 configuration is going on in the BaseCPU.py SimObject python file and
 not a configuration file, but I could be convinced there's a reason.
 Even if this isn't really a fix or the right thing to do, I'd still
 like to try it temporarily at least to see if it corrects the problem
 I'm seeing.

 Gabe

 Ali Saidi wrote:


 I haven't seen any strange behavior yet. That isn't to say it's not
 going to cause an issue in the future, but we've taken many a tlb miss
 and it hasn't fallen over yet.

 Ali

 On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt ste...@gmail.com
 wrote:

 Yea, I just got around to reading this thread and that was the point
 I was going to make... the L1 cache effectively serves as a
 translator between the CPU's word-size read  write requests and the
 coherent block-level requests that get snooped.  If you attach a
 CPU-like device (such as the table walker) directly to an L2, the
 CPU-like accesses that go to the L2 will get sent to the L1s but I'm
 not sure they'll be handled correctly.  Not that they fundamentally
 couldn't, this just isn't a configuration we test so it's likely that
 there are problems... for example, the L1 may try to hand ownership
 to the requester but the requester won't recognize that and things
 will break.

 Steve

 On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black gbl...@eecs.umich.edu
 gbl...@eecs.umich.edu wrote:

What happens if an entry is in the L1 but not the L2?

Gabe

Ali Saidi wrote:
 Between the l1 and l2 caches seems like a good place to me. The
caches can cache page table entries, otherwise a tlb 

Re: [m5-dev] X86 FS regression

2010-11-23 Thread Ali Saidi


So what is the relatively good way to make this work in the short
term? A bus? What about the slightly better version? I suppose a small
cache might be ok and probably somewhat realistic. 

Thanks, 

Ali 

On
Tue, 23 Nov 2010 08:15:01 -0800, Steve Reinhardt  wrote:  

And even
though I do think it could be made to work, I'm not sure it would be
easy or a good idea. There are a lot of corner cases to worry about,
especially for writes, since you'd have to actually buffer the write
data somewhere as opposed to just remembering that so-and-so has
requested an exclusive copy.

Actually as I think about it, that might
be the case that's breaking now... if the L1 has an exclusive copy and
then it snoops a write (and not a read-exclusive), I'm guessing it will
just invalidate its copy, losing the modifications. I wouldn't be
terribly surprised if reads are working OK (the L1 should snoop those
and respond if it's the owner), and of course it's all OK if the L1
doesn't have a copy of the block.

So maybe there is a relatively easy
way to make this work, but figuring out whether that's true and then
testing it is still a non-trivial amount of effort.

Steve

On Tue, Nov
23, 2010 at 7:57 AM, Steve Reinhardt  wrote:
 No, when the L2 receives a
request it assumes the L1s above it have already been snooped, which is
true since the request came in on the bus that the L1s snoop. The issue
is that caches don't necessarily behave correctly when non-cache-block
requests come in through their mem-side (snoop) port and not through
their cpu-side (request) port. I'm guessing this could be made to work,
I'd just be very surprised if it does right now, since the caches
weren't designed to deal with this case and aren't tested this
way.

Steve 

On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi  wrote:

Does
it? Shouldn't the l2 receive the request, ask for the block and end up
snooping the l1s? 

Ali 

On Tue, 23 Nov 2010 07:30:00 -0800, Steve
Reinhardt  wrote:

The point is that connecting between the L1 and
L2 induces the same problems wrt the L1 that connecting directly to
memory induces wrt the whole cache hierarchy. You're just statistically
more likely to get away with it in the former case because the L1 is
smaller.

Steve

On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi  wrote:   


Where are you connecting the table walker? If it's between the l1 and l2
my guess is that it will work. if it is to the memory bus, yes, memory
is just responding without the help of a cache and this could be the
reason.

 Ali  

 On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black  wrote:
 

I think I may have just now. I've fixed a few issues, and am now
getting
 to the point where something that should be in the pagetables
is causing
 a page fault. I found where the table walker is walking the
tables for
 this particular access, and the last level entry is all 0s.
There could
 be a number of reasons this is all 0s, but since the main
difference
 other than timing between this and a working configuration
is the
 presence of caches and we've identified a potential issue there,
I'm
 inclined to suspect the actual page table entry is still in the L1
and
 hasn't been evicted out to memory yet.

 To fix this, is the best
solution to add a bus below the CPU for all the
 connections that need
to go to the L1? I'm assuming they'd all go into
 the dcache since
they're more data-ey and that keeps the icache read
 only (ignoring SMC
issues), and the dcache is probably servicing lower
 bandwidth normally.
It also seems a little strange that this type of
 configuration is going
on in the BaseCPU.py SimObject python file and
 not a configuration
file, but I could be convinced there's a reason.
 Even if this isn't
really a fix or the right thing to do, I'd still
 like to try it
temporarily at least to see if it corrects the problem
 I'm seeing.


Gabe

 Ali Saidi wrote:   

 I haven't seen any strange behavior yet.
That isn't to say it's not
 going to cause an issue in the future, but
we've taken many a tlb miss
 and it hasn't fallen over yet.

 Ali

 On
Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt 
 wrote:

Yea, I just
got around to reading this thread and that was the point
 I was going to
make... the L1 cache effectively serves as a
 translator between the
CPU's word-size read  write requests and the
 coherent block-level
requests that get snooped. If you attach a
 CPU-like device (such as the
table walker) directly to an L2, the
 CPU-like accesses that go to the
L2 will get sent to the L1s but I'm
 not sure they'll be handled
correctly. Not that they fundamentally
 couldn't, this just isn't a
configuration we test so it's likely that
 there are problems... for
example, the L1 may try to hand ownership
 to the requester but the
requester won't recognize that and things
 will break.

 Steve

 On Mon,
Nov 22, 2010 at 12:00 PM, Gabe Black  wrote:

 What happens if an entry
is in the L1 but not the L2?

 Gabe

 Ali Saidi wrote:
  Between the l1
and l2 caches 

Re: [m5-dev] X86 FS regression

2010-11-23 Thread Steve Reinhardt
I think the two easy (python-only) solutions are sharing the existing L1 via
a bus and tacking on a small L1 to the walker.  Which one is more realistic
would depend on what you're trying to model.

Steve

On Tue, Nov 23, 2010 at 8:23 AM, Ali Saidi sa...@umich.edu wrote:

 So what is the relatively good way to make this work in the short term? A
 bus? What about the slightly better version? I suppose a small cache might
 be ok and probably somewhat realistic.



 Thanks,

 Ali





 On Tue, 23 Nov 2010 08:15:01 -0800, Steve Reinhardt ste...@gmail.com
 wrote:

 And even though I do think it could be made to work, I'm not sure it would
 be easy or a good idea.  There are a lot of corner cases to worry about,
 especially for writes, since you'd have to actually buffer the write data
 somewhere as opposed to just remembering that so-and-so has requested an
 exclusive copy.

 Actually as I think about it, that might be the case that's breaking now...
 if the L1 has an exclusive copy and then it snoops a write (and not a
 read-exclusive), I'm guessing it will just invalidate its copy, losing the
 modifications.  I wouldn't be terribly surprised if reads are working OK
 (the L1 should snoop those and respond if it's the owner), and of course
 it's all OK if the L1 doesn't have a copy of the block.

 So maybe there is a relatively easy way to make this work, but figuring out
 whether that's true and then testing it is still a non-trivial amount of
 effort.

 Steve

 On Tue, Nov 23, 2010 at 7:57 AM, Steve Reinhardt ste...@gmail.com wrote:

 No, when the L2 receives a request it assumes the L1s above it have
 already been snooped, which is true since the request came in on the bus
 that the L1s snoop.  The issue is that caches don't necessarily behave
 correctly when non-cache-block requests come in through their mem-side
 (snoop) port and not through their cpu-side (request) port.  I'm guessing
 this could be made to work, I'd just be very surprised if it does right now,
 since the caches weren't designed to deal with this case and aren't tested
 this way.

 Steve


 On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi sa...@umich.edu wrote:

 Does it? Shouldn't the l2 receive the request, ask for the block and end
 up snooping the l1s?



 Ali





 On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt ste...@gmail.com
 wrote:

  The point is that connecting between the L1 and L2 induces the same
 problems wrt the L1 that connecting directly to memory induces wrt the whole
 cache hierarchy.  You're just statistically more likely to get away with it
 in the former case because the L1 is smaller.

 Steve

   On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi sa...@umich.edu wrote:


 Where are you connecting the table walker? If it's between the l1 and l2
 my guess is that it will work. if it is to the memory bus, yes, memory is
 just responding without the help of a cache and this could be the reason.

 Ali



 On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black gbl...@eecs.umich.edu
 wrote:

  I think I may have just now. I've fixed a few issues, and am now
 getting
 to the point where something that should be in the pagetables is
 causing
 a page fault. I found where the table walker is walking the tables for
 this particular access, and the last level entry is all 0s. There could
 be a number of reasons this is all 0s, but since the main difference
 other than timing between this and a working configuration is the
 presence of caches and we've identified a potential issue there, I'm
 inclined to suspect the actual page table entry is still in the L1 and
 hasn't been evicted out to memory yet.

 To fix this, is the best solution to add a bus below the CPU for all
 the
 connections that need to go to the L1? I'm assuming they'd all go into
 the dcache since they're more data-ey and that keeps the icache read
 only (ignoring SMC issues), and the dcache is probably servicing lower
 bandwidth normally. It also seems a little strange that this type of
 configuration is going on in the BaseCPU.py SimObject python file and
 not a configuration file, but I could be convinced there's a reason.
 Even if this isn't really a fix or the right thing to do, I'd still
 like to try it temporarily at least to see if it corrects the problem
 I'm seeing.

 Gabe

 Ali Saidi wrote:


 I haven't seen any strange behavior yet. That isn't to say it's not
 going to cause an issue in the future, but we've taken many a tlb miss
 and it hasn't fallen over yet.

 Ali

 On Mon, 22 Nov 2010 13:08:13 -0800, Steve Reinhardt ste...@gmail.com
 
 wrote:

   Yea, I just got around to reading this thread and that was the
 point
 I was going to make... the L1 cache effectively serves as a
 translator between the CPU's word-size read  write requests and the
 coherent block-level requests that get snooped.  If you attach a
 CPU-like device (such as the table walker) directly to an L2, the
 CPU-like accesses that go to the L2 will get sent to the L1s but I'm
 not sure they'll be 

Re: [m5-dev] X86 FS regression

2010-11-23 Thread Gabe Black
Of these, I think the walker cache sounds better for two reasons. First,
it avoids the L1 pollution Ali was talking about, and second, a new bus
would add mostly inert stuff on the way to memory and which would
involve looking up what port to use even though it'd always be the same
one. I'll give that a try.

Gabe

Steve Reinhardt wrote:
 I think the two easy (python-only) solutions are sharing the existing
 L1 via a bus and tacking on a small L1 to the walker.  Which one is
 more realistic would depend on what you're trying to model.

 Steve

 On Tue, Nov 23, 2010 at 8:23 AM, Ali Saidi sa...@umich.edu
 mailto:sa...@umich.edu wrote:

 So what is the relatively good way to make this work in the short
 term? A bus? What about the slightly better version? I suppose a
 small cache might be ok and probably somewhat realistic.

  

 Thanks,

 Ali

  

  

 On Tue, 23 Nov 2010 08:15:01 -0800, Steve Reinhardt
 ste...@gmail.com mailto:ste...@gmail.com wrote:

 And even though I do think it could be made to work, I'm not sure
 it would be easy or a good idea.  There are a lot of corner cases
 to worry about, especially for writes, since you'd have to
 actually buffer the write data somewhere as opposed to just
 remembering that so-and-so has requested an exclusive copy.

 Actually as I think about it, that might be the case that's
 breaking now... if the L1 has an exclusive copy and then it
 snoops a write (and not a read-exclusive), I'm guessing it will
 just invalidate its copy, losing the modifications.  I wouldn't
 be terribly surprised if reads are working OK (the L1 should
 snoop those and respond if it's the owner), and of course it's
 all OK if the L1 doesn't have a copy of the block.

 So maybe there is a relatively easy way to make this work, but
 figuring out whether that's true and then testing it is still a
 non-trivial amount of effort.

 Steve

 On Tue, Nov 23, 2010 at 7:57 AM, Steve Reinhardt
 ste...@gmail.com mailto:ste...@gmail.com wrote:

 No, when the L2 receives a request it assumes the L1s above
 it have already been snooped, which is true since the request
 came in on the bus that the L1s snoop.  The issue is that
 caches don't necessarily behave correctly when
 non-cache-block requests come in through their mem-side
 (snoop) port and not through their cpu-side (request) port. 
 I'm guessing this could be made to work, I'd just be very
 surprised if it does right now, since the caches weren't
 designed to deal with this case and aren't tested this way.

 Steve


 On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi sa...@umich.edu
 mailto:sa...@umich.edu wrote:

 Does it? Shouldn't the l2 receive the request, ask for
 the block and end up snooping the l1s?

  

 Ali

  

  

 On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt
 ste...@gmail.com mailto:ste...@gmail.com wrote:

 The point is that connecting between the L1 and L2
 induces the same problems wrt the L1 that connecting
 directly to memory induces wrt the whole cache
 hierarchy.  You're just statistically more likely to
 get away with it in the former case because the L1 is
 smaller.

 Steve

 On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi
 sa...@umich.edu mailto:sa...@umich.edu wrote:


 Where are you connecting the table walker? If
 it's between the l1 and l2 my guess is that it
 will work. if it is to the memory bus, yes,
 memory is just responding without the help of a
 cache and this could be the reason.

 Ali



 On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black
 gbl...@eecs.umich.edu
 mailto:gbl...@eecs.umich.edu wrote:

 I think I may have just now. I've fixed a few
 issues, and am now getting
 to the point where something that should be
 in the pagetables is causing
 a page fault. I found where the table walker
 is walking the tables for
 this particular access, and the last level
 entry is all 0s. There could
 be a number of reasons this is all 0s, but
 since the main difference
 other than timing between this and a working
 configuration is the
 presence of caches and we've identified a
 

[m5-dev] X86 FS regression

2010-11-22 Thread Gabe Black
For anybody waiting for an x86 FS regression (yes, I know, you can
all hardly wait, but don't let this spoil your Thanksgiving) I'm getting
closer to having it working, but I've discovered some issues with the
mechanisms behind the --caches flag with fs.py and x86. I'm surprised I
never thought to try it before. It also brings up some questions about
where the table walkers should be hooked up in x86 and ARM. Currently
it's after the L1, if any, but before the L2, if any, which seems wrong
to me. Also caches don't seem to propagate requests upwards to the CPUs
which may or may not be an issue. I'm still looking into that.

Gabe
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] X86 FS regression

2010-11-22 Thread Gabe Black
Hmm. It looks like this IO cache is only added when there are caches in
the system (a fix for some coherency something? I sort of remember that
discussion.) and that wouldn't propagate to the IO bus the fact that the
CPU's local APIC wanted to receive interrupt messages passed over the
memory system. I don't know the intricacies of why the IO cache was
necessary, or what problems passing requests back up through the cache
might cause, but this is a serious issue for x86 and any other ISA that
wants to move to a message based interrupt scheme. I suppose the
interrupt objects could be connected all the way out onto the IO bus
itself, bypassing that cache, but I'm not sure how realistic that is.

Gabe Black wrote:
 For anybody waiting for an x86 FS regression (yes, I know, you can
 all hardly wait, but don't let this spoil your Thanksgiving) I'm getting
 closer to having it working, but I've discovered some issues with the
 mechanisms behind the --caches flag with fs.py and x86. I'm surprised I
 never thought to try it before. It also brings up some questions about
 where the table walkers should be hooked up in x86 and ARM. Currently
 it's after the L1, if any, but before the L2, if any, which seems wrong
 to me. Also caches don't seem to propagate requests upwards to the CPUs
 which may or may not be an issue. I'm still looking into that.

 Gabe
 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev
   

___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] X86 FS regression

2010-11-22 Thread Ali Saidi
Between the l1 and l2 caches seems like a good place to me. The caches can 
cache page table entries, otherwise a tlb miss would be even more expensive 
then it is. The l1 isn't normally used for such things since it would get 
polluted (look why sparc has a load 128bits from l2, do not allocate into l1 
instruction). 

Ali

On Nov 22, 2010, at 4:27 AM, Gabe Black wrote:

For anybody waiting for an x86 FS regression (yes, I know, you can
 all hardly wait, but don't let this spoil your Thanksgiving) I'm getting
 closer to having it working, but I've discovered some issues with the
 mechanisms behind the --caches flag with fs.py and x86. I'm surprised I
 never thought to try it before. It also brings up some questions about
 where the table walkers should be hooked up in x86 and ARM. Currently
 it's after the L1, if any, but before the L2, if any, which seems wrong
 to me. Also caches don't seem to propagate requests upwards to the CPUs
 which may or may not be an issue. I'm still looking into that.
 
 Gabe
 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev
 

___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] X86 FS regression

2010-11-22 Thread Ali Saidi
Something has to maintain i/o coherency and that something looks an whole lot 
like a couple line cache. Why is having a cache there any issue, they should 
pass right through the cache?

Ali



On Nov 22, 2010, at 4:42 AM, Gabe Black wrote:

 Hmm. It looks like this IO cache is only added when there are caches in
 the system (a fix for some coherency something? I sort of remember that
 discussion.) and that wouldn't propagate to the IO bus the fact that the
 CPU's local APIC wanted to receive interrupt messages passed over the
 memory system. I don't know the intricacies of why the IO cache was
 necessary, or what problems passing requests back up through the cache
 might cause, but this is a serious issue for x86 and any other ISA that
 wants to move to a message based interrupt scheme. I suppose the
 interrupt objects could be connected all the way out onto the IO bus
 itself, bypassing that cache, but I'm not sure how realistic that is.
 
 Gabe Black wrote:
For anybody waiting for an x86 FS regression (yes, I know, you can
 all hardly wait, but don't let this spoil your Thanksgiving) I'm getting
 closer to having it working, but I've discovered some issues with the
 mechanisms behind the --caches flag with fs.py and x86. I'm surprised I
 never thought to try it before. It also brings up some questions about
 where the table walkers should be hooked up in x86 and ARM. Currently
 it's after the L1, if any, but before the L2, if any, which seems wrong
 to me. Also caches don't seem to propagate requests upwards to the CPUs
 which may or may not be an issue. I'm still looking into that.
 
 Gabe
 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev
 
 
 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev
 

___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] X86 FS regression

2010-11-22 Thread Gabe Black
The cache claims to support all addresses on the CPU side (or so says
the comments), but no addresses on the memory side. Messages going from
the IO interrupt controller get to the IO bus but then don't know where
to go since the IO cache hides the fact that the CPU interrupt
controller wants to receive messages on that address range. I also don't
know if the cache can handle messages passing through originating from
the memory side, but I didn't look into that.

Gabe

Ali Saidi wrote:
 Something has to maintain i/o coherency and that something looks an whole lot 
 like a couple line cache. Why is having a cache there any issue, they should 
 pass right through the cache?

 Ali



 On Nov 22, 2010, at 4:42 AM, Gabe Black wrote:

   
 Hmm. It looks like this IO cache is only added when there are caches in
 the system (a fix for some coherency something? I sort of remember that
 discussion.) and that wouldn't propagate to the IO bus the fact that the
 CPU's local APIC wanted to receive interrupt messages passed over the
 memory system. I don't know the intricacies of why the IO cache was
 necessary, or what problems passing requests back up through the cache
 might cause, but this is a serious issue for x86 and any other ISA that
 wants to move to a message based interrupt scheme. I suppose the
 interrupt objects could be connected all the way out onto the IO bus
 itself, bypassing that cache, but I'm not sure how realistic that is.

 Gabe Black wrote:
 
For anybody waiting for an x86 FS regression (yes, I know, you can
 all hardly wait, but don't let this spoil your Thanksgiving) I'm getting
 closer to having it working, but I've discovered some issues with the
 mechanisms behind the --caches flag with fs.py and x86. I'm surprised I
 never thought to try it before. It also brings up some questions about
 where the table walkers should be hooked up in x86 and ARM. Currently
 it's after the L1, if any, but before the L2, if any, which seems wrong
 to me. Also caches don't seem to propagate requests upwards to the CPUs
 which may or may not be an issue. I'm still looking into that.

 Gabe
 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev

   
 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev

 

 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev
   

___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] X86 FS regression

2010-11-22 Thread Gabe Black
What happens if an entry is in the L1 but not the L2?

Gabe

Ali Saidi wrote:
 Between the l1 and l2 caches seems like a good place to me. The caches can 
 cache page table entries, otherwise a tlb miss would be even more expensive 
 then it is. The l1 isn't normally used for such things since it would get 
 polluted (look why sparc has a load 128bits from l2, do not allocate into l1 
 instruction). 

 Ali

 On Nov 22, 2010, at 4:27 AM, Gabe Black wrote:

   
For anybody waiting for an x86 FS regression (yes, I know, you can
 all hardly wait, but don't let this spoil your Thanksgiving) I'm getting
 closer to having it working, but I've discovered some issues with the
 mechanisms behind the --caches flag with fs.py and x86. I'm surprised I
 never thought to try it before. It also brings up some questions about
 where the table walkers should be hooked up in x86 and ARM. Currently
 it's after the L1, if any, but before the L2, if any, which seems wrong
 to me. Also caches don't seem to propagate requests upwards to the CPUs
 which may or may not be an issue. I'm still looking into that.

 Gabe
 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev

 

 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev
   

___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] X86 FS regression

2010-11-22 Thread Steve Reinhardt
I believe the I/O cache is normally paired with a bridge that lets things
flow in the other direction.  It's really just designed to handle accesses
to cacheable space from devices on the I/O bus without requiring each device
to have a cache.  It's possible we've never had a situation before where I/O
devices issue accesses to uncacheable non-memory locations on the CPU side
of the I/O cache, in which case I would not be terribly surprised if that
didn't quite work.

Steve

On Mon, Nov 22, 2010 at 11:59 AM, Gabe Black gbl...@eecs.umich.edu wrote:

 The cache claims to support all addresses on the CPU side (or so says
 the comments), but no addresses on the memory side. Messages going from
 the IO interrupt controller get to the IO bus but then don't know where
 to go since the IO cache hides the fact that the CPU interrupt
 controller wants to receive messages on that address range. I also don't
 know if the cache can handle messages passing through originating from
 the memory side, but I didn't look into that.

 Gabe

 Ali Saidi wrote:
  Something has to maintain i/o coherency and that something looks an whole
 lot like a couple line cache. Why is having a cache there any issue, they
 should pass right through the cache?
 
  Ali
 
 
 
  On Nov 22, 2010, at 4:42 AM, Gabe Black wrote:
 
 
  Hmm. It looks like this IO cache is only added when there are caches in
  the system (a fix for some coherency something? I sort of remember that
  discussion.) and that wouldn't propagate to the IO bus the fact that the
  CPU's local APIC wanted to receive interrupt messages passed over the
  memory system. I don't know the intricacies of why the IO cache was
  necessary, or what problems passing requests back up through the cache
  might cause, but this is a serious issue for x86 and any other ISA that
  wants to move to a message based interrupt scheme. I suppose the
  interrupt objects could be connected all the way out onto the IO bus
  itself, bypassing that cache, but I'm not sure how realistic that is.
 
  Gabe Black wrote:
 
 For anybody waiting for an x86 FS regression (yes, I know, you can
  all hardly wait, but don't let this spoil your Thanksgiving) I'm
 getting
  closer to having it working, but I've discovered some issues with the
  mechanisms behind the --caches flag with fs.py and x86. I'm surprised I
  never thought to try it before. It also brings up some questions about
  where the table walkers should be hooked up in x86 and ARM. Currently
  it's after the L1, if any, but before the L2, if any, which seems wrong
  to me. Also caches don't seem to propagate requests upwards to the CPUs
  which may or may not be an issue. I'm still looking into that.
 
  Gabe
  ___
  m5-dev mailing list
  m5-dev@m5sim.org
  http://m5sim.org/mailman/listinfo/m5-dev
 
 
  ___
  m5-dev mailing list
  m5-dev@m5sim.org
  http://m5sim.org/mailman/listinfo/m5-dev
 
 
 
  ___
  m5-dev mailing list
  m5-dev@m5sim.org
  http://m5sim.org/mailman/listinfo/m5-dev
 

 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev

___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] X86 FS regression

2010-11-22 Thread Steve Reinhardt
Yea, I just got around to reading this thread and that was the point I was
going to make... the L1 cache effectively serves as a translator between the
CPU's word-size read  write requests and the coherent block-level requests
that get snooped.  If you attach a CPU-like device (such as the table
walker) directly to an L2, the CPU-like accesses that go to the L2 will get
sent to the L1s but I'm not sure they'll be handled correctly.  Not that
they fundamentally couldn't, this just isn't a configuration we test so it's
likely that there are problems... for example, the L1 may try to hand
ownership to the requester but the requester won't recognize that and things
will break.

Steve

On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black gbl...@eecs.umich.edu wrote:

 What happens if an entry is in the L1 but not the L2?

 Gabe

 Ali Saidi wrote:
  Between the l1 and l2 caches seems like a good place to me. The caches
 can cache page table entries, otherwise a tlb miss would be even more
 expensive then it is. The l1 isn't normally used for such things since it
 would get polluted (look why sparc has a load 128bits from l2, do not
 allocate into l1 instruction).
 
  Ali
 
  On Nov 22, 2010, at 4:27 AM, Gabe Black wrote:
 
 
 For anybody waiting for an x86 FS regression (yes, I know, you can
  all hardly wait, but don't let this spoil your Thanksgiving) I'm getting
  closer to having it working, but I've discovered some issues with the
  mechanisms behind the --caches flag with fs.py and x86. I'm surprised I
  never thought to try it before. It also brings up some questions about
  where the table walkers should be hooked up in x86 and ARM. Currently
  it's after the L1, if any, but before the L2, if any, which seems wrong
  to me. Also caches don't seem to propagate requests upwards to the CPUs
  which may or may not be an issue. I'm still looking into that.
 
  Gabe
  ___
  m5-dev mailing list
  m5-dev@m5sim.org
  http://m5sim.org/mailman/listinfo/m5-dev
 
 
 
  ___
  m5-dev mailing list
  m5-dev@m5sim.org
  http://m5sim.org/mailman/listinfo/m5-dev
 

 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev

___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev


Re: [m5-dev] X86 FS regression

2010-11-22 Thread Ali Saidi


I haven't seen any strange behavior yet. That isn't to say it's not
going to cause an issue in the future, but we've taken many a tlb miss
and it hasn't fallen over yet. 

Ali 

On Mon, 22 Nov 2010 13:08:13
-0800, Steve Reinhardt  wrote:  

Yea, I just got around to reading this
thread and that was the point I was going to make... the L1 cache
effectively serves as a translator between the CPU's word-size read 
write requests and the coherent block-level requests that get snooped.
If you attach a CPU-like device (such as the table walker) directly to
an L2, the CPU-like accesses that go to the L2 will get sent to the L1s
but I'm not sure they'll be handled correctly. Not that they
fundamentally couldn't, this just isn't a configuration we test so it's
likely that there are problems... for example, the L1 may try to hand
ownership to the requester but the requester won't recognize that and
things will break.

Steve

On Mon, Nov 22, 2010 at 12:00 PM, Gabe Black 
wrote:
 What happens if an entry is in the L1 but not the L2?

 Gabe


Ali Saidi wrote:
  Between the l1 and l2 caches seems like a good place
to me. The caches can cache page table entries, otherwise a tlb miss
would be even more expensive then it is. The l1 isn't normally used for
such things since it would get polluted (look why sparc has a load
128bits from l2, do not allocate into l1 instruction).
 
  Ali
 
 
On Nov 22, 2010, at 4:27 AM, Gabe Black wrote:
 
 
  For anybody
waiting for an x86 FS regression (yes, I know, you can
  all hardly
wait, but don't let this spoil your Thanksgiving) I'm getting
  closer
to having it working, but I've discovered some issues with the
 
mechanisms behind the --caches flag with fs.py and x86. I'm surprised I

 never thought to try it before. It also brings up some questions
about
  where the table walkers should be hooked up in x86 and ARM.
Currently
  it's after the L1, if any, but before the L2, if any,
which seems wrong
  to me. Also caches don't seem to propagate
requests upwards to the CPUs
  which may or may not be an issue. I'm
still looking into that.
 
  Gabe
 
___
  m5-dev mailing list

 m5-dev@m5sim.org [2]
  http://m5sim.org/mailman/listinfo/m5-dev
[3]
 
 
 
  ___
 
m5-dev mailing list
  m5-dev@m5sim.org [4]
 
http://m5sim.org/mailman/listinfo/m5-dev [5]
 


___
 m5-dev mailing
list
m5-dev@m5sim.org [6]
http://m5sim.org/mailman/listinfo/m5-dev [7]  


 

Links:
--
[1] mailto:gbl...@eecs.umich.edu
[2]
mailto:m5-dev@m5sim.org
[3] http://m5sim.org/mailman/listinfo/m5-dev
[4]
mailto:m5-dev@m5sim.org
[5] http://m5sim.org/mailman/listinfo/m5-dev
[6]
mailto:m5-dev@m5sim.org
[7] http://m5sim.org/mailman/listinfo/m5-dev
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev