[gem5-users] arm, aarch64: asymmetry between load-pair and store-pair implementation

Arthur Perais Fri, 15 Jan 2016 01:59:53 -0800

Hello fellow gem5 users,

I recently stumbled into an interesting behavior in a program that usesa lot of store-pair/load-pair instructions (which if I understandcorrectly act as the push and pop of Aarch64).What happens is that store-pair (of two 64-bit registers) is implementedas two 64-bit store micro-ops. Conversely, load-pair is implemented as asingle 128-bit micro-op. As a result, it is not possible for load-pairto benefit from store-to-load forwarding from a corresponding store pairinstruction, because an instruction cannot get its data from twodifferent (although contiguous in that case) SQ entries. So I ended upwe a lot of "rescheduledLoads". Those loads wait for the store theypartially depend on (generally one of the two store-pair micro-op) towrite-back before issuing again.

While this appears mostly inconsequential (I considered the olden suite,I know it is really old, but it finishes in hours so it's nice fordebugging), one of the program (bh) has a lot of "rescheduledLoads"(231M for 5.4B committed microops). As a result, I tried modifying howload/store-pair are cracked, either one micro-op for each, or twomicro-ops for each. On olden, the former has speedup of 1.02 (geomean)while the latter has 0.989 (geomean) on the arm_detailed CPU. So ingeneral ipc does not change, but on bh, the former gives a speedup of11% while the latter gives 2%. I reckon that the difference might growhigher on a more aggressive CPU, since some of the performance loss dueto STLF stall may be hidden by resource contention (ROB/IQ full) inarm_detailed.

If the problem is really the inability to do STLF, then using twomicroops for both instructions should help, because a load-pair will beallowed to match other regular 64-bit stores in addition to storemicro-ops of store pairs. On the other hand, more microops will enterthe pipe. However, the two micro-op solution does not improveperformance as much in bh as the single micro-op one, so it appearsthat having less microops is more important than having lessrescheduledLoads. The only issue is that the single micro-op solutionassumes that the SQ can hold 128 bits per entry (which might explain theinitial asymmetric implementation).

So the first question is if anyone can provide insight into theasymmetric design, and the second one is if it sounds reasonable to dothe one micro-op only solution (although for completeness, any 128-bitstore currently cracked into two micro-ops should be reimplemented).


Cheers,

Arthur Perais.


--
Arthur Perais
INRIA Bretagne Atlantique
Bâtiment 12E, Bureau E303, Campus de Beaulieu
35042 Rennes, France

_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

[gem5-users] arm, aarch64: asymmetry between load-pair and store-pair implementation

Reply via email to