Hi Marc,

THe atomic latency isn't as accurate at the latency with the memory system in 
timing mode. What is returned is an unloaded latency (one request in the entire 
memory system/no contention at all).

Ali

On May 19, 2011, at 12:01 AM, Marc de Kruijf wrote:

> Thanks Ali and Korey.  
> 
> My checkout is about a month old so that could be the issue.  I'll take a 
> look tomorrow.
> MSHR settings are okay.  I'm using the atomic-reported memory latencies 
> (dcache_latency and icache_latency) to compute access latencies in my model.  
> I assume these are as accurate as the timing or O3 CPU latencies for 
> single-threaded workloads.
> 
> If I keep having issues I'm happy to share the CPU model but it's a research 
> prototype configured to do exotic researchy sorts of things so I'm not sure 
> how helpful that would be.  =)
> 
> On Wed, May 18, 2011 at 11:33 PM, Korey Sewell <ksew...@umich.edu> wrote:
> I'd also take a look at how many MSHRs you are giving your caches and see if 
> it matches w/your cpu model. For example, if you only have 2 mshrs, but your 
> model is issuing up to 8 speculative loads, its a chance your system may be 
> under provisioned and eventually lose some performance.
> 
> 
> On Thu, May 19, 2011 at 12:28 AM, Ali Saidi <sa...@umich.edu> wrote:
> Hi Marc,
> 
> If you haven't updated your code recently, I committed some changes last week 
> at fixed some dependency issues with the ARM condition codes in the o3 cpu 
> model. Previously any instruction that wrote a condition code would have to 
> do a read-modify-write operation on all the condition codes together meaning 
> that a string of instructions that set condition codes were all dependent on 
> each other. The committed code fixes this issue and sees improvement of up to 
> 22% on some spec benchmarks.
> 
> If that doesn't fix the issue, you'll need to see where the o3 model is 
> stalling on your workload. Some of the statistics might help narrow it down a 
> bit. The model should be able to issue dependent instructions in back-to-back 
> cycles, and executes instruction speculatively (including loads).
> 
> Any chance you'd share your cpu model? Are you sure you're accounting for 
> memory latency correctly in it? The atomic memory mode completes a load/store 
> instantly, so if you're not correctly accounting for the real time it would 
> take for that load/store to complete that could be part of the issue.
> 
> Ali
> 
> On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote:
> 
> > Hi all,
> >
> > I recently extended the atomic CPU model to simulate a deeply-pipelined 
> > two-issue in-order machine.  The code includes variable length instruction 
> > latencies, checks for register dependences, has full bypass/forwarding 
> > capability, and so on.  I have reason to believe it is working as it should.
> >
> > Curiously, when I run binaries using this CPU model, it frequently 
> > outperforms the O3 CPU model in terms of cycle count.  The O3 model I 
> > compare against is also two-issue, has a 8-entry load queue, 8-entry store 
> > queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise 
> > configured identically.  The in-order core models identical branch 
> > prediction with a rather generous 13-cycle mispredict penalty for the 
> > two-issue core (e.g. as in ARM Cortex-A8), and still achieves better 
> > performance in most cases.
> >
> > I'm finding it hard to parse through all the O3 trace logs, so I was 
> > wondering if anyone has intuition as to why this might be the case.  Does 
> > the O3 CPU not do full bypassing?  Is there speculation going on beyond 
> > just branch prediction?  I plan to look into the source code in more 
> > detail, but I was wondering if someone could give me a leg up by pointing 
> > me in the right direction.
> >
> > I've also noticed when I set the MemRead and MemWrite latencies in 
> > src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance 
> > slows down quite drastically (~10% per increment).  This doesn't really 
> > make sense to me either.  I'm not configuring with a massive instruction 
> > window, but I wouldn't expect performance to suffer quite so much.  If it 
> > helps, all my simulations so far are just using ARM.
> > _______________________________________________
> > gem5-users mailing list
> > gem5-users@m5sim.org
> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> 
> _______________________________________________
> gem5-users mailing list
> gem5-users@m5sim.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> 
> 
> 
> -- 
> - Korey
> 
> _______________________________________________
> gem5-users mailing list
> gem5-users@m5sim.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
> 
> _______________________________________________
> gem5-users mailing list
> gem5-users@m5sim.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

_______________________________________________
gem5-users mailing list
gem5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to