Hi all,

I recently extended the atomic CPU model to simulate a deeply-pipelined
two-issue in-order machine.  The code includes variable length instruction
latencies, checks for register dependences, has full bypass/forwarding
capability, and so on.  I have reason to believe it is working as it should.

Curiously, when I run binaries using this CPU model, it frequently
outperforms the O3 CPU model in terms of cycle count.  The O3 model I
compare against is also two-issue, has a 8-entry load queue, 8-entry store
queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise
configured identically.  The in-order core models identical branch
prediction with a rather generous 13-cycle mispredict penalty for the
two-issue core (e.g. as in ARM Cortex-A8), and still achieves better
performance in most cases.

I'm finding it hard to parse through all the O3 trace logs, so I was
wondering if anyone has intuition as to why this might be the case.  Does
the O3 CPU not do full bypassing?  Is there speculation going on beyond just
branch prediction?  I plan to look into the source code in more detail, but
I was wondering if someone could give me a leg up by pointing me in the
right direction.

I've also noticed when I set the MemRead and MemWrite latencies in
src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance
slows down quite drastically (~10% per increment).  This doesn't really make
sense to me either.  I'm not configuring with a massive instruction window,
but I wouldn't expect performance to suffer quite so much.  If it helps, all
my simulations so far are just using ARM.
_______________________________________________
gem5-users mailing list
gem5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to