Re: [gem5-users] In-order faster than O3?

2011-05-29 Thread Marc de Kruijf
For anyone who might be interested, this issue has been resolved.

There were two issues. First, I was assuming O3 used a write buffer to allow
stores to retire early (implying non-SC memory consistency).  I assumed one
for my in-order processor but O3 appears to be SC and hence does not have
one.  Second, when I increased the MemWrite and MemRead latencies, I did not
also increase the wbDepth setting beyond the default of 1, which created a
bottleneck when multiple longer latency operations were executing
simultaneously.


On Thu, May 19, 2011 at 12:07 AM, Ali Saidi sa...@umich.edu wrote:

 Hi Marc,

 THe atomic latency isn't as accurate at the latency with the memory system
 in timing mode. What is returned is an unloaded latency (one request in the
 entire memory system/no contention at all).

 Ali

 On May 19, 2011, at 12:01 AM, Marc de Kruijf wrote:

 Thanks Ali and Korey.

 My checkout is about a month old so that could be the issue.  I'll take a
 look tomorrow.
 MSHR settings are okay.  I'm using the atomic-reported memory latencies
 (dcache_latency and icache_latency) to compute access latencies in my model.
  I assume these are as accurate as the timing or O3 CPU latencies for
 single-threaded workloads.

 If I keep having issues I'm happy to share the CPU model but it's a
 research prototype configured to do exotic researchy sorts of things so I'm
 not sure how helpful that would be.  =)

 On Wed, May 18, 2011 at 11:33 PM, Korey Sewell ksew...@umich.edu wrote:

 I'd also take a look at how many MSHRs you are giving your caches and see
 if it matches w/your cpu model. For example, if you only have 2 mshrs, but
 your model is issuing up to 8 speculative loads, its a chance your system
 may be under provisioned and eventually lose some performance.


 On Thu, May 19, 2011 at 12:28 AM, Ali Saidi sa...@umich.edu wrote:

 Hi Marc,

 If you haven't updated your code recently, I committed some changes last
 week at fixed some dependency issues with the ARM condition codes in the o3
 cpu model. Previously any instruction that wrote a condition code would have
 to do a read-modify-write operation on all the condition codes together
 meaning that a string of instructions that set condition codes were all
 dependent on each other. The committed code fixes this issue and sees
 improvement of up to 22% on some spec benchmarks.

 If that doesn't fix the issue, you'll need to see where the o3 model is
 stalling on your workload. Some of the statistics might help narrow it down
 a bit. The model should be able to issue dependent instructions in
 back-to-back cycles, and executes instruction speculatively (including
 loads).

 Any chance you'd share your cpu model? Are you sure you're accounting for
 memory latency correctly in it? The atomic memory mode completes a
 load/store instantly, so if you're not correctly accounting for the real
 time it would take for that load/store to complete that could be part of the
 issue.

 Ali

 On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote:

  Hi all,
 
  I recently extended the atomic CPU model to simulate a deeply-pipelined
 two-issue in-order machine.  The code includes variable length instruction
 latencies, checks for register dependences, has full bypass/forwarding
 capability, and so on.  I have reason to believe it is working as it should.
 
  Curiously, when I run binaries using this CPU model, it frequently
 outperforms the O3 CPU model in terms of cycle count.  The O3 model I
 compare against is also two-issue, has a 8-entry load queue, 8-entry store
 queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise
 configured identically.  The in-order core models identical branch
 prediction with a rather generous 13-cycle mispredict penalty for the
 two-issue core (e.g. as in ARM Cortex-A8), and still achieves better
 performance in most cases.
 
  I'm finding it hard to parse through all the O3 trace logs, so I was
 wondering if anyone has intuition as to why this might be the case.  Does
 the O3 CPU not do full bypassing?  Is there speculation going on beyond just
 branch prediction?  I plan to look into the source code in more detail, but
 I was wondering if someone could give me a leg up by pointing me in the
 right direction.
 
  I've also noticed when I set the MemRead and MemWrite latencies in
 src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance
 slows down quite drastically (~10% per increment).  This doesn't really make
 sense to me either.  I'm not configuring with a massive instruction window,
 but I wouldn't expect performance to suffer quite so much.  If it helps, all
 my simulations so far are just using ARM.
  ___
  gem5-users mailing list
  gem5-users@m5sim.org
  http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

 ___
 gem5-users mailing list
 gem5-users@m5sim.org
 http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users




 --
 - 

[gem5-users] In-order faster than O3?

2011-05-18 Thread Marc de Kruijf
Hi all,

I recently extended the atomic CPU model to simulate a deeply-pipelined
two-issue in-order machine.  The code includes variable length instruction
latencies, checks for register dependences, has full bypass/forwarding
capability, and so on.  I have reason to believe it is working as it should.

Curiously, when I run binaries using this CPU model, it frequently
outperforms the O3 CPU model in terms of cycle count.  The O3 model I
compare against is also two-issue, has a 8-entry load queue, 8-entry store
queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise
configured identically.  The in-order core models identical branch
prediction with a rather generous 13-cycle mispredict penalty for the
two-issue core (e.g. as in ARM Cortex-A8), and still achieves better
performance in most cases.

I'm finding it hard to parse through all the O3 trace logs, so I was
wondering if anyone has intuition as to why this might be the case.  Does
the O3 CPU not do full bypassing?  Is there speculation going on beyond just
branch prediction?  I plan to look into the source code in more detail, but
I was wondering if someone could give me a leg up by pointing me in the
right direction.

I've also noticed when I set the MemRead and MemWrite latencies in
src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance
slows down quite drastically (~10% per increment).  This doesn't really make
sense to me either.  I'm not configuring with a massive instruction window,
but I wouldn't expect performance to suffer quite so much.  If it helps, all
my simulations so far are just using ARM.
___
gem5-users mailing list
gem5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] In-order faster than O3?

2011-05-18 Thread Ali Saidi
Hi Marc,

If you haven't updated your code recently, I committed some changes last week 
at fixed some dependency issues with the ARM condition codes in the o3 cpu 
model. Previously any instruction that wrote a condition code would have to do 
a read-modify-write operation on all the condition codes together meaning that 
a string of instructions that set condition codes were all dependent on each 
other. The committed code fixes this issue and sees improvement of up to 22% on 
some spec benchmarks.

If that doesn't fix the issue, you'll need to see where the o3 model is 
stalling on your workload. Some of the statistics might help narrow it down a 
bit. The model should be able to issue dependent instructions in back-to-back 
cycles, and executes instruction speculatively (including loads). 

Any chance you'd share your cpu model? Are you sure you're accounting for 
memory latency correctly in it? The atomic memory mode completes a load/store 
instantly, so if you're not correctly accounting for the real time it would 
take for that load/store to complete that could be part of the issue.

Ali

On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote:

 Hi all,
 
 I recently extended the atomic CPU model to simulate a deeply-pipelined 
 two-issue in-order machine.  The code includes variable length instruction 
 latencies, checks for register dependences, has full bypass/forwarding 
 capability, and so on.  I have reason to believe it is working as it should.
 
 Curiously, when I run binaries using this CPU model, it frequently 
 outperforms the O3 CPU model in terms of cycle count.  The O3 model I compare 
 against is also two-issue, has a 8-entry load queue, 8-entry store queue, 
 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise configured 
 identically.  The in-order core models identical branch prediction with a 
 rather generous 13-cycle mispredict penalty for the two-issue core (e.g. as 
 in ARM Cortex-A8), and still achieves better performance in most cases.
 
 I'm finding it hard to parse through all the O3 trace logs, so I was 
 wondering if anyone has intuition as to why this might be the case.  Does the 
 O3 CPU not do full bypassing?  Is there speculation going on beyond just 
 branch prediction?  I plan to look into the source code in more detail, but I 
 was wondering if someone could give me a leg up by pointing me in the right 
 direction.
 
 I've also noticed when I set the MemRead and MemWrite latencies in 
 src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance slows 
 down quite drastically (~10% per increment).  This doesn't really make sense 
 to me either.  I'm not configuring with a massive instruction window, but I 
 wouldn't expect performance to suffer quite so much.  If it helps, all my 
 simulations so far are just using ARM.
 ___
 gem5-users mailing list
 gem5-users@m5sim.org
 http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

___
gem5-users mailing list
gem5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users


Re: [gem5-users] In-order faster than O3?

2011-05-18 Thread Korey Sewell
I'd also take a look at how many MSHRs you are giving your caches and see if
it matches w/your cpu model. For example, if you only have 2 mshrs, but your
model is issuing up to 8 speculative loads, its a chance your system may be
under provisioned and eventually lose some performance.

On Thu, May 19, 2011 at 12:28 AM, Ali Saidi sa...@umich.edu wrote:

 Hi Marc,

 If you haven't updated your code recently, I committed some changes last
 week at fixed some dependency issues with the ARM condition codes in the o3
 cpu model. Previously any instruction that wrote a condition code would have
 to do a read-modify-write operation on all the condition codes together
 meaning that a string of instructions that set condition codes were all
 dependent on each other. The committed code fixes this issue and sees
 improvement of up to 22% on some spec benchmarks.

 If that doesn't fix the issue, you'll need to see where the o3 model is
 stalling on your workload. Some of the statistics might help narrow it down
 a bit. The model should be able to issue dependent instructions in
 back-to-back cycles, and executes instruction speculatively (including
 loads).

 Any chance you'd share your cpu model? Are you sure you're accounting for
 memory latency correctly in it? The atomic memory mode completes a
 load/store instantly, so if you're not correctly accounting for the real
 time it would take for that load/store to complete that could be part of the
 issue.

 Ali

 On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote:

  Hi all,
 
  I recently extended the atomic CPU model to simulate a deeply-pipelined
 two-issue in-order machine.  The code includes variable length instruction
 latencies, checks for register dependences, has full bypass/forwarding
 capability, and so on.  I have reason to believe it is working as it should.
 
  Curiously, when I run binaries using this CPU model, it frequently
 outperforms the O3 CPU model in terms of cycle count.  The O3 model I
 compare against is also two-issue, has a 8-entry load queue, 8-entry store
 queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise
 configured identically.  The in-order core models identical branch
 prediction with a rather generous 13-cycle mispredict penalty for the
 two-issue core (e.g. as in ARM Cortex-A8), and still achieves better
 performance in most cases.
 
  I'm finding it hard to parse through all the O3 trace logs, so I was
 wondering if anyone has intuition as to why this might be the case.  Does
 the O3 CPU not do full bypassing?  Is there speculation going on beyond just
 branch prediction?  I plan to look into the source code in more detail, but
 I was wondering if someone could give me a leg up by pointing me in the
 right direction.
 
  I've also noticed when I set the MemRead and MemWrite latencies in
 src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance
 slows down quite drastically (~10% per increment).  This doesn't really make
 sense to me either.  I'm not configuring with a massive instruction window,
 but I wouldn't expect performance to suffer quite so much.  If it helps, all
 my simulations so far are just using ARM.
  ___
  gem5-users mailing list
  gem5-users@m5sim.org
  http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

 ___
 gem5-users mailing list
 gem5-users@m5sim.org
 http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users




-- 
- Korey
___
gem5-users mailing list
gem5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] In-order faster than O3?

2011-05-18 Thread Marc de Kruijf
Thanks Ali and Korey.

My checkout is about a month old so that could be the issue.  I'll take a
look tomorrow.
MSHR settings are okay.  I'm using the atomic-reported memory latencies
(dcache_latency and icache_latency) to compute access latencies in my model.
 I assume these are as accurate as the timing or O3 CPU latencies for
single-threaded workloads.

If I keep having issues I'm happy to share the CPU model but it's a research
prototype configured to do exotic researchy sorts of things so I'm not sure
how helpful that would be.  =)

On Wed, May 18, 2011 at 11:33 PM, Korey Sewell ksew...@umich.edu wrote:

 I'd also take a look at how many MSHRs you are giving your caches and see
 if it matches w/your cpu model. For example, if you only have 2 mshrs, but
 your model is issuing up to 8 speculative loads, its a chance your system
 may be under provisioned and eventually lose some performance.


 On Thu, May 19, 2011 at 12:28 AM, Ali Saidi sa...@umich.edu wrote:

 Hi Marc,

 If you haven't updated your code recently, I committed some changes last
 week at fixed some dependency issues with the ARM condition codes in the o3
 cpu model. Previously any instruction that wrote a condition code would have
 to do a read-modify-write operation on all the condition codes together
 meaning that a string of instructions that set condition codes were all
 dependent on each other. The committed code fixes this issue and sees
 improvement of up to 22% on some spec benchmarks.

 If that doesn't fix the issue, you'll need to see where the o3 model is
 stalling on your workload. Some of the statistics might help narrow it down
 a bit. The model should be able to issue dependent instructions in
 back-to-back cycles, and executes instruction speculatively (including
 loads).

 Any chance you'd share your cpu model? Are you sure you're accounting for
 memory latency correctly in it? The atomic memory mode completes a
 load/store instantly, so if you're not correctly accounting for the real
 time it would take for that load/store to complete that could be part of the
 issue.

 Ali

 On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote:

  Hi all,
 
  I recently extended the atomic CPU model to simulate a deeply-pipelined
 two-issue in-order machine.  The code includes variable length instruction
 latencies, checks for register dependences, has full bypass/forwarding
 capability, and so on.  I have reason to believe it is working as it should.
 
  Curiously, when I run binaries using this CPU model, it frequently
 outperforms the O3 CPU model in terms of cycle count.  The O3 model I
 compare against is also two-issue, has a 8-entry load queue, 8-entry store
 queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise
 configured identically.  The in-order core models identical branch
 prediction with a rather generous 13-cycle mispredict penalty for the
 two-issue core (e.g. as in ARM Cortex-A8), and still achieves better
 performance in most cases.
 
  I'm finding it hard to parse through all the O3 trace logs, so I was
 wondering if anyone has intuition as to why this might be the case.  Does
 the O3 CPU not do full bypassing?  Is there speculation going on beyond just
 branch prediction?  I plan to look into the source code in more detail, but
 I was wondering if someone could give me a leg up by pointing me in the
 right direction.
 
  I've also noticed when I set the MemRead and MemWrite latencies in
 src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance
 slows down quite drastically (~10% per increment).  This doesn't really make
 sense to me either.  I'm not configuring with a massive instruction window,
 but I wouldn't expect performance to suffer quite so much.  If it helps, all
 my simulations so far are just using ARM.
  ___
  gem5-users mailing list
  gem5-users@m5sim.org
  http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

 ___
 gem5-users mailing list
 gem5-users@m5sim.org
 http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users




 --
 - Korey

 ___
 gem5-users mailing list
 gem5-users@m5sim.org
 http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

___
gem5-users mailing list
gem5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] In-order faster than O3?

2011-05-18 Thread Korey Sewell
exotic researchy sorts of things
In my early morning delirious state, that sir is a +1.




-- 
- Korey
___
gem5-users mailing list
gem5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] In-order faster than O3?

2011-05-18 Thread Ali Saidi
Hi Marc,

THe atomic latency isn't as accurate at the latency with the memory system in 
timing mode. What is returned is an unloaded latency (one request in the entire 
memory system/no contention at all).

Ali

On May 19, 2011, at 12:01 AM, Marc de Kruijf wrote:

 Thanks Ali and Korey.  
 
 My checkout is about a month old so that could be the issue.  I'll take a 
 look tomorrow.
 MSHR settings are okay.  I'm using the atomic-reported memory latencies 
 (dcache_latency and icache_latency) to compute access latencies in my model.  
 I assume these are as accurate as the timing or O3 CPU latencies for 
 single-threaded workloads.
 
 If I keep having issues I'm happy to share the CPU model but it's a research 
 prototype configured to do exotic researchy sorts of things so I'm not sure 
 how helpful that would be.  =)
 
 On Wed, May 18, 2011 at 11:33 PM, Korey Sewell ksew...@umich.edu wrote:
 I'd also take a look at how many MSHRs you are giving your caches and see if 
 it matches w/your cpu model. For example, if you only have 2 mshrs, but your 
 model is issuing up to 8 speculative loads, its a chance your system may be 
 under provisioned and eventually lose some performance.
 
 
 On Thu, May 19, 2011 at 12:28 AM, Ali Saidi sa...@umich.edu wrote:
 Hi Marc,
 
 If you haven't updated your code recently, I committed some changes last week 
 at fixed some dependency issues with the ARM condition codes in the o3 cpu 
 model. Previously any instruction that wrote a condition code would have to 
 do a read-modify-write operation on all the condition codes together meaning 
 that a string of instructions that set condition codes were all dependent on 
 each other. The committed code fixes this issue and sees improvement of up to 
 22% on some spec benchmarks.
 
 If that doesn't fix the issue, you'll need to see where the o3 model is 
 stalling on your workload. Some of the statistics might help narrow it down a 
 bit. The model should be able to issue dependent instructions in back-to-back 
 cycles, and executes instruction speculatively (including loads).
 
 Any chance you'd share your cpu model? Are you sure you're accounting for 
 memory latency correctly in it? The atomic memory mode completes a load/store 
 instantly, so if you're not correctly accounting for the real time it would 
 take for that load/store to complete that could be part of the issue.
 
 Ali
 
 On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote:
 
  Hi all,
 
  I recently extended the atomic CPU model to simulate a deeply-pipelined 
  two-issue in-order machine.  The code includes variable length instruction 
  latencies, checks for register dependences, has full bypass/forwarding 
  capability, and so on.  I have reason to believe it is working as it should.
 
  Curiously, when I run binaries using this CPU model, it frequently 
  outperforms the O3 CPU model in terms of cycle count.  The O3 model I 
  compare against is also two-issue, has a 8-entry load queue, 8-entry store 
  queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise 
  configured identically.  The in-order core models identical branch 
  prediction with a rather generous 13-cycle mispredict penalty for the 
  two-issue core (e.g. as in ARM Cortex-A8), and still achieves better 
  performance in most cases.
 
  I'm finding it hard to parse through all the O3 trace logs, so I was 
  wondering if anyone has intuition as to why this might be the case.  Does 
  the O3 CPU not do full bypassing?  Is there speculation going on beyond 
  just branch prediction?  I plan to look into the source code in more 
  detail, but I was wondering if someone could give me a leg up by pointing 
  me in the right direction.
 
  I've also noticed when I set the MemRead and MemWrite latencies in 
  src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance 
  slows down quite drastically (~10% per increment).  This doesn't really 
  make sense to me either.  I'm not configuring with a massive instruction 
  window, but I wouldn't expect performance to suffer quite so much.  If it 
  helps, all my simulations so far are just using ARM.
  ___
  gem5-users mailing list
  gem5-users@m5sim.org
  http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
 
 ___
 gem5-users mailing list
 gem5-users@m5sim.org
 http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
 
 
 
 -- 
 - Korey
 
 ___
 gem5-users mailing list
 gem5-users@m5sim.org
 http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
 
 ___
 gem5-users mailing list
 gem5-users@m5sim.org
 http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

___
gem5-users mailing list
gem5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users