For anyone who might be interested, this issue has been resolved. There were two issues. First, I was assuming O3 used a write buffer to allow stores to retire early (implying non-SC memory consistency). I assumed one for my in-order processor but O3 appears to be SC and hence does not have one. Second, when I increased the MemWrite and MemRead latencies, I did not also increase the wbDepth setting beyond the default of 1, which created a bottleneck when multiple longer latency operations were executing simultaneously.
On Thu, May 19, 2011 at 12:07 AM, Ali Saidi <sa...@umich.edu> wrote: > Hi Marc, > > THe atomic latency isn't as accurate at the latency with the memory system > in timing mode. What is returned is an unloaded latency (one request in the > entire memory system/no contention at all). > > Ali > > On May 19, 2011, at 12:01 AM, Marc de Kruijf wrote: > > Thanks Ali and Korey. > > My checkout is about a month old so that could be the issue. I'll take a > look tomorrow. > MSHR settings are okay. I'm using the atomic-reported memory latencies > (dcache_latency and icache_latency) to compute access latencies in my model. > I assume these are as accurate as the timing or O3 CPU latencies for > single-threaded workloads. > > If I keep having issues I'm happy to share the CPU model but it's a > research prototype configured to do exotic researchy sorts of things so I'm > not sure how helpful that would be. =) > > On Wed, May 18, 2011 at 11:33 PM, Korey Sewell <ksew...@umich.edu> wrote: > >> I'd also take a look at how many MSHRs you are giving your caches and see >> if it matches w/your cpu model. For example, if you only have 2 mshrs, but >> your model is issuing up to 8 speculative loads, its a chance your system >> may be under provisioned and eventually lose some performance. >> >> >> On Thu, May 19, 2011 at 12:28 AM, Ali Saidi <sa...@umich.edu> wrote: >> >>> Hi Marc, >>> >>> If you haven't updated your code recently, I committed some changes last >>> week at fixed some dependency issues with the ARM condition codes in the o3 >>> cpu model. Previously any instruction that wrote a condition code would have >>> to do a read-modify-write operation on all the condition codes together >>> meaning that a string of instructions that set condition codes were all >>> dependent on each other. The committed code fixes this issue and sees >>> improvement of up to 22% on some spec benchmarks. >>> >>> If that doesn't fix the issue, you'll need to see where the o3 model is >>> stalling on your workload. Some of the statistics might help narrow it down >>> a bit. The model should be able to issue dependent instructions in >>> back-to-back cycles, and executes instruction speculatively (including >>> loads). >>> >>> Any chance you'd share your cpu model? Are you sure you're accounting for >>> memory latency correctly in it? The atomic memory mode completes a >>> load/store instantly, so if you're not correctly accounting for the real >>> time it would take for that load/store to complete that could be part of the >>> issue. >>> >>> Ali >>> >>> On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote: >>> >>> > Hi all, >>> > >>> > I recently extended the atomic CPU model to simulate a deeply-pipelined >>> two-issue in-order machine. The code includes variable length instruction >>> latencies, checks for register dependences, has full bypass/forwarding >>> capability, and so on. I have reason to believe it is working as it should. >>> > >>> > Curiously, when I run binaries using this CPU model, it frequently >>> outperforms the O3 CPU model in terms of cycle count. The O3 model I >>> compare against is also two-issue, has a 8-entry load queue, 8-entry store >>> queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise >>> configured identically. The in-order core models identical branch >>> prediction with a rather generous 13-cycle mispredict penalty for the >>> two-issue core (e.g. as in ARM Cortex-A8), and still achieves better >>> performance in most cases. >>> > >>> > I'm finding it hard to parse through all the O3 trace logs, so I was >>> wondering if anyone has intuition as to why this might be the case. Does >>> the O3 CPU not do full bypassing? Is there speculation going on beyond just >>> branch prediction? I plan to look into the source code in more detail, but >>> I was wondering if someone could give me a leg up by pointing me in the >>> right direction. >>> > >>> > I've also noticed when I set the MemRead and MemWrite latencies in >>> src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance >>> slows down quite drastically (~10% per increment). This doesn't really make >>> sense to me either. I'm not configuring with a massive instruction window, >>> but I wouldn't expect performance to suffer quite so much. If it helps, all >>> my simulations so far are just using ARM. >>> > _______________________________________________ >>> > gem5-users mailing list >>> > gem5-users@m5sim.org >>> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>> >>> _______________________________________________ >>> gem5-users mailing list >>> gem5-users@m5sim.org >>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>> >> >> >> >> -- >> - Korey >> >> _______________________________________________ >> gem5-users mailing list >> gem5-users@m5sim.org >> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >> > > _______________________________________________ > gem5-users mailing list > gem5-users@m5sim.org > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users > > > > _______________________________________________ > gem5-users mailing list > gem5-users@m5sim.org > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >
_______________________________________________ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users