For anyone who might be interested, this issue has been resolved.

There were two issues. First, I was assuming O3 used a write buffer to allow
stores to retire early (implying non-SC memory consistency).  I assumed one
for my in-order processor but O3 appears to be SC and hence does not have
one.  Second, when I increased the MemWrite and MemRead latencies, I did not
also increase the wbDepth setting beyond the default of 1, which created a
bottleneck when multiple longer latency operations were executing
simultaneously.


On Thu, May 19, 2011 at 12:07 AM, Ali Saidi <sa...@umich.edu> wrote:

> Hi Marc,
>
> THe atomic latency isn't as accurate at the latency with the memory system
> in timing mode. What is returned is an unloaded latency (one request in the
> entire memory system/no contention at all).
>
> Ali
>
> On May 19, 2011, at 12:01 AM, Marc de Kruijf wrote:
>
> Thanks Ali and Korey.
>
> My checkout is about a month old so that could be the issue.  I'll take a
> look tomorrow.
> MSHR settings are okay.  I'm using the atomic-reported memory latencies
> (dcache_latency and icache_latency) to compute access latencies in my model.
>  I assume these are as accurate as the timing or O3 CPU latencies for
> single-threaded workloads.
>
> If I keep having issues I'm happy to share the CPU model but it's a
> research prototype configured to do exotic researchy sorts of things so I'm
> not sure how helpful that would be.  =)
>
> On Wed, May 18, 2011 at 11:33 PM, Korey Sewell <ksew...@umich.edu> wrote:
>
>> I'd also take a look at how many MSHRs you are giving your caches and see
>> if it matches w/your cpu model. For example, if you only have 2 mshrs, but
>> your model is issuing up to 8 speculative loads, its a chance your system
>> may be under provisioned and eventually lose some performance.
>>
>>
>> On Thu, May 19, 2011 at 12:28 AM, Ali Saidi <sa...@umich.edu> wrote:
>>
>>> Hi Marc,
>>>
>>> If you haven't updated your code recently, I committed some changes last
>>> week at fixed some dependency issues with the ARM condition codes in the o3
>>> cpu model. Previously any instruction that wrote a condition code would have
>>> to do a read-modify-write operation on all the condition codes together
>>> meaning that a string of instructions that set condition codes were all
>>> dependent on each other. The committed code fixes this issue and sees
>>> improvement of up to 22% on some spec benchmarks.
>>>
>>> If that doesn't fix the issue, you'll need to see where the o3 model is
>>> stalling on your workload. Some of the statistics might help narrow it down
>>> a bit. The model should be able to issue dependent instructions in
>>> back-to-back cycles, and executes instruction speculatively (including
>>> loads).
>>>
>>> Any chance you'd share your cpu model? Are you sure you're accounting for
>>> memory latency correctly in it? The atomic memory mode completes a
>>> load/store instantly, so if you're not correctly accounting for the real
>>> time it would take for that load/store to complete that could be part of the
>>> issue.
>>>
>>> Ali
>>>
>>> On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote:
>>>
>>> > Hi all,
>>> >
>>> > I recently extended the atomic CPU model to simulate a deeply-pipelined
>>> two-issue in-order machine.  The code includes variable length instruction
>>> latencies, checks for register dependences, has full bypass/forwarding
>>> capability, and so on.  I have reason to believe it is working as it should.
>>> >
>>> > Curiously, when I run binaries using this CPU model, it frequently
>>> outperforms the O3 CPU model in terms of cycle count.  The O3 model I
>>> compare against is also two-issue, has a 8-entry load queue, 8-entry store
>>> queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise
>>> configured identically.  The in-order core models identical branch
>>> prediction with a rather generous 13-cycle mispredict penalty for the
>>> two-issue core (e.g. as in ARM Cortex-A8), and still achieves better
>>> performance in most cases.
>>> >
>>> > I'm finding it hard to parse through all the O3 trace logs, so I was
>>> wondering if anyone has intuition as to why this might be the case.  Does
>>> the O3 CPU not do full bypassing?  Is there speculation going on beyond just
>>> branch prediction?  I plan to look into the source code in more detail, but
>>> I was wondering if someone could give me a leg up by pointing me in the
>>> right direction.
>>> >
>>> > I've also noticed when I set the MemRead and MemWrite latencies in
>>> src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance
>>> slows down quite drastically (~10% per increment).  This doesn't really make
>>> sense to me either.  I'm not configuring with a massive instruction window,
>>> but I wouldn't expect performance to suffer quite so much.  If it helps, all
>>> my simulations so far are just using ARM.
>>> > _______________________________________________
>>> > gem5-users mailing list
>>> > gem5-users@m5sim.org
>>> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-users@m5sim.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>
>>
>>
>> --
>> - Korey
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-users@m5sim.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>
> _______________________________________________
> gem5-users mailing list
> gem5-users@m5sim.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-users@m5sim.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
_______________________________________________
gem5-users mailing list
gem5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to