Hello fellow gem5 users,

I recently stumbled into an interesting behavior in a program that uses a lot of store-pair/load-pair instructions (which if I understand correctly act as the push and pop of Aarch64). What happens is that store-pair (of two 64-bit registers) is implemented as two 64-bit store micro-ops. Conversely, load-pair is implemented as a single 128-bit micro-op. As a result, it is not possible for load-pair to benefit from store-to-load forwarding from a corresponding store pair instruction, because an instruction cannot get its data from two different (although contiguous in that case) SQ entries. So I ended up we a lot of "rescheduledLoads". Those loads wait for the store they partially depend on (generally one of the two store-pair micro-op) to write-back before issuing again.

While this appears mostly inconsequential (I considered the olden suite, I know it is really old, but it finishes in hours so it's nice for debugging), one of the program (bh) has a lot of "rescheduledLoads" (231M for 5.4B committed microops). As a result, I tried modifying how load/store-pair are cracked, either one micro-op for each, or two micro-ops for each. On olden, the former has speedup of 1.02 (geomean) while the latter has 0.989 (geomean) on the arm_detailed CPU. So in general ipc does not change, but on bh, the former gives a speedup of 11% while the latter gives 2%. I reckon that the difference might grow higher on a more aggressive CPU, since some of the performance loss due to STLF stall may be hidden by resource contention (ROB/IQ full) in arm_detailed.

If the problem is really the inability to do STLF, then using two microops for both instructions should help, because a load-pair will be allowed to match other regular 64-bit stores in addition to store micro-ops of store pairs. On the other hand, more microops will enter the pipe. However, the two micro-op solution does not improve performance as much in bh as the single micro-op one, so it appears that having less microops is more important than having less rescheduledLoads. The only issue is that the single micro-op solution assumes that the SQ can hold 128 bits per entry (which might explain the initial asymmetric implementation).

So the first question is if anyone can provide insight into the asymmetric design, and the second one is if it sounds reasonable to do the one micro-op only solution (although for completeness, any 128-bit store currently cracked into two micro-ops should be reimplemented).

Cheers,

Arthur Perais.


--
Arthur Perais
INRIA Bretagne Atlantique
Bâtiment 12E, Bureau E303, Campus de Beaulieu
35042 Rennes, France

_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to