Hello fellow gem5 users,
I recently stumbled into an interesting behavior in a program that uses
a lot of store-pair/load-pair instructions (which if I understand
correctly act as the push and pop of Aarch64).
What happens is that store-pair (of two 64-bit registers) is implemented
as two 64-bit store micro-ops. Conversely, load-pair is implemented as a
single 128-bit micro-op. As a result, it is not possible for load-pair
to benefit from store-to-load forwarding from a corresponding store pair
instruction, because an instruction cannot get its data from two
different (although contiguous in that case) SQ entries. So I ended up
we a lot of "rescheduledLoads". Those loads wait for the store they
partially depend on (generally one of the two store-pair micro-op) to
write-back before issuing again.
While this appears mostly inconsequential (I considered the olden suite,
I know it is really old, but it finishes in hours so it's nice for
debugging), one of the program (bh) has a lot of "rescheduledLoads"
(231M for 5.4B committed microops). As a result, I tried modifying how
load/store-pair are cracked, either one micro-op for each, or two
micro-ops for each. On olden, the former has speedup of 1.02 (geomean)
while the latter has 0.989 (geomean) on the arm_detailed CPU. So in
general ipc does not change, but on bh, the former gives a speedup of
11% while the latter gives 2%. I reckon that the difference might grow
higher on a more aggressive CPU, since some of the performance loss due
to STLF stall may be hidden by resource contention (ROB/IQ full) in
arm_detailed.
If the problem is really the inability to do STLF, then using two
microops for both instructions should help, because a load-pair will be
allowed to match other regular 64-bit stores in addition to store
micro-ops of store pairs. On the other hand, more microops will enter
the pipe. However, the two micro-op solution does not improve
performance as much in bh as the single micro-op one, so it appears
that having less microops is more important than having less
rescheduledLoads. The only issue is that the single micro-op solution
assumes that the SQ can hold 128 bits per entry (which might explain the
initial asymmetric implementation).
So the first question is if anyone can provide insight into the
asymmetric design, and the second one is if it sounds reasonable to do
the one micro-op only solution (although for completeness, any 128-bit
store currently cracked into two micro-ops should be reimplemented).
Cheers,
Arthur Perais.
--
Arthur Perais
INRIA Bretagne Atlantique
Bâtiment 12E, Bureau E303, Campus de Beaulieu
35042 Rennes, France
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users