Hi all, I recently extended the atomic CPU model to simulate a deeply-pipelined two-issue in-order machine. The code includes variable length instruction latencies, checks for register dependences, has full bypass/forwarding capability, and so on. I have reason to believe it is working as it should.
Curiously, when I run binaries using this CPU model, it frequently outperforms the O3 CPU model in terms of cycle count. The O3 model I compare against is also two-issue, has a 8-entry load queue, 8-entry store queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise configured identically. The in-order core models identical branch prediction with a rather generous 13-cycle mispredict penalty for the two-issue core (e.g. as in ARM Cortex-A8), and still achieves better performance in most cases. I'm finding it hard to parse through all the O3 trace logs, so I was wondering if anyone has intuition as to why this might be the case. Does the O3 CPU not do full bypassing? Is there speculation going on beyond just branch prediction? I plan to look into the source code in more detail, but I was wondering if someone could give me a leg up by pointing me in the right direction. I've also noticed when I set the MemRead and MemWrite latencies in src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance slows down quite drastically (~10% per increment). This doesn't really make sense to me either. I'm not configuring with a massive instruction window, but I wouldn't expect performance to suffer quite so much. If it helps, all my simulations so far are just using ARM.
_______________________________________________ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users