I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got a much lower IPC than the theoretical IPC. The issue seems to be data dependencies caused by (control) flags, not registers, and I am wondering if anyone has come across the same issue.
The microbenchmark has many data independent ADD instructions (http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/general_purpose/arithmetic/add_and_subtract.py#l41) in a loop. On a 2-wide out-of-order machine with enough resources, the IPC should be two at a steady stated. However, the IPC only goes up to one. What is happening is that even though the ADDs have two source and one destination registers and a flag to set in x86, gem5 adds one extra flag source register to the ADDs. As a result, each ADD becomes dependent on the earlier ADD's destination flag, constraining the achievable IPC to one. Here is an example sequence with physical register mappings: ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag) ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ... Physical registers 98, 9, and 92 are ready when those two ADDs are renamed; however, as you can see, the second ADD has to wait for the first ADD because of the extra flag source register S3. When I removed those flags in the macroop definition, the IPC jumped up from 1 to 1.7. Does anyone know why the ADD has to read the flags, even though the x86 manual does not say that? Those flags should just cause write-after-write dependency, not read-after-write. Yasuko _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
