Re: [gem5-users] In-order faster than O3?

Ali Saidi Wed, 18 May 2011 22:08:19 -0700

Hi Marc,

THe atomic latency isn't as accurate at the latency with the memory system in 
timing mode. What is returned is an unloaded latency (one request in the entire 
memory system/no contention at all).


Ali

On May 19, 2011, at 12:01 AM, Marc de Kruijf wrote:

> Thanks Ali and Korey.  
> 
> My checkout is about a month old so that could be the issue.  I'll take a 
> look tomorrow.
> MSHR settings are okay.  I'm using the atomic-reported memory latencies 
> (dcache_latency and icache_latency) to compute access latencies in my model.  
> I assume these are as accurate as the timing or O3 CPU latencies for 
> single-threaded workloads.
> 
> If I keep having issues I'm happy to share the CPU model but it's a research 
> prototype configured to do exotic researchy sorts of things so I'm not sure 
> how helpful that would be.  =)
> 
> On Wed, May 18, 2011 at 11:33 PM, Korey Sewell <ksew...@umich.edu> wrote:
> I'd also take a look at how many MSHRs you are giving your caches and see if 
> it matches w/your cpu model. For example, if you only have 2 mshrs, but your 
> model is issuing up to 8 speculative loads, its a chance your system may be 
> under provisioned and eventually lose some performance.
> 
> 
> On Thu, May 19, 2011 at 12:28 AM, Ali Saidi <sa...@umich.edu> wrote:
> Hi Marc,
> 
> If you haven't updated your code recently, I committed some changes last week 
> at fixed some dependency issues with the ARM condition codes in the o3 cpu 
> model. Previously any instruction that wrote a condition code would have to 
> do a read-modify-write operation on all the condition codes together meaning 
> that a string of instructions that set condition codes were all dependent on 
> each other. The committed code fixes this issue and sees improvement of up to 
> 22% on some spec benchmarks.
> 
> If that doesn't fix the issue, you'll need to see where the o3 model is 
> stalling on your workload. Some of the statistics might help narrow it down a 
> bit. The model should be able to issue dependent instructions in back-to-back 
> cycles, and executes instruction speculatively (including loads).
> 
> Any chance you'd share your cpu model? Are you sure you're accounting for 
> memory latency correctly in it? The atomic memory mode completes a load/store 
> instantly, so if you're not correctly accounting for the real time it would 
> take for that load/store to complete that could be part of the issue.
> 
> Ali
> 
> On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote:
> 
> > Hi all,
> >
> > I recently extended the atomic CPU model to simulate a deeply-pipelined 
> > two-issue in-order machine.  The code includes variable length instruction 
> > latencies, checks for register dependences, has full bypass/forwarding 
> > capability, and so on.  I have reason to believe it is working as it should.
> >
> > Curiously, when I run binaries using this CPU model, it frequently 
> > outperforms the O3 CPU model in terms of cycle count.  The O3 model I 
> > compare against is also two-issue, has a 8-entry load queue, 8-entry store 
> > queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise 
> > configured identically.  The in-order core models identical branch 
> > prediction with a rather generous 13-cycle mispredict penalty for the 
> > two-issue core (e.g. as in ARM Cortex-A8), and still achieves better 
> > performance in most cases.
> >
> > I'm finding it hard to parse through all the O3 trace logs, so I was 
> > wondering if anyone has intuition as to why this might be the case.  Does 
> > the O3 CPU not do full bypassing?  Is there speculation going on beyond 
> > just branch prediction?  I plan to look into the source code in more 
> > detail, but I was wondering if someone could give me a leg up by pointing 
> > me in the right direction.
> >
> > I've also noticed when I set the MemRead and MemWrite latencies in 
> > src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance 
> > slows down quite drastically (~10% per increment).  This doesn't really 
> > make sense to me either.  I'm not configuring with a massive instruction 
> > window, but I wouldn't expect performance to suffer quite so much.  If it 
> > helps, all my simulations so far are just using ARM.
> > _______________________________________________
> > gem5-users mailing list
> > gem5-users@m5sim.org
> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> 
> _______________________________________________
> gem5-users mailing list
> gem5-users@m5sim.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> 
> 
> 
> -- 
> - Korey
> 
> _______________________________________________
> gem5-users mailing list
> gem5-users@m5sim.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> 
> _______________________________________________
> gem5-users mailing list
> gem5-users@m5sim.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

_______________________________________________
gem5-users mailing list
gem5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] In-order faster than O3?

Reply via email to