Hi Arthur,
Apologies on the delay in responding.

Thanks for looking into this.  As to why it is that way, I think it's
mostly an accident of history. If you were to reimplement the store pair
as a single 128 bit store we would be in support of the patch entering
mainline gem5.  It is reasonable to expect forwarding between these
operations without the rescheduling problem for well-behaved code, so
such a patch could improve fidelity in those cases.


Curtis

-----Original Message-----
From: gem5-users [mailto:[email protected]] On Behalf Of Arthur Perais
Sent: Friday, January 15, 2016 3:59 AM
To: gem5 users mailing list
Subject: [gem5-users] arm, aarch64: asymmetry between load-pair and store-pair 
implementation

Hello fellow gem5 users,

I recently stumbled into an interesting behavior in a program that uses
a lot of store-pair/load-pair instructions (which if I understand
correctly act as the push and pop of Aarch64).
What happens is that store-pair (of two 64-bit registers) is implemented
as two 64-bit store micro-ops. Conversely, load-pair is implemented as a
single 128-bit micro-op. As a result, it is not possible for load-pair
to benefit from store-to-load forwarding from a corresponding store pair
instruction, because an instruction cannot get its data from two
different (although contiguous in that case) SQ entries. So I ended up
we a lot of "rescheduledLoads". Those loads wait for the store they
partially depend on (generally one of the two store-pair micro-op) to
write-back before issuing again.

While this appears mostly inconsequential (I considered the olden suite,
I know it is really old, but it finishes in hours so it's nice for
debugging), one of the program (bh) has a lot of "rescheduledLoads"
(231M for  5.4B committed microops). As a result, I tried modifying how
load/store-pair are cracked, either one micro-op for each, or two
micro-ops for each.  On olden, the former has speedup of 1.02 (geomean)
while the latter has 0.989 (geomean) on the arm_detailed CPU. So in
general ipc does not change, but on bh, the former gives a speedup of
11% while the latter gives 2%. I reckon that the difference might grow
higher on a more aggressive CPU, since some of the performance loss due
to STLF stall may be hidden by resource contention (ROB/IQ full) in
arm_detailed.

If the problem is really the inability to do STLF, then using two
microops for both instructions should help, because a load-pair will be
allowed to match other regular 64-bit stores in addition to store
micro-ops of store pairs. On the other hand, more microops will enter
the pipe. However, the two micro-op solution does not improve
performance as much in bh as the single micro-op one,  so it appears
that having less microops is more important than having less
rescheduledLoads. The only issue is that the single micro-op solution
assumes that the SQ can hold 128 bits per entry (which might explain the
initial asymmetric implementation).

So the first question is if anyone can provide insight into the
asymmetric design, and the second one is if it sounds reasonable to do
the one micro-op only solution (although for completeness, any 128-bit
store currently cracked into two micro-ops should be reimplemented).

Cheers,

Arthur Perais.


--
Arthur Perais
INRIA Bretagne Atlantique
Bâtiment 12E, Bureau E303, Campus de Beaulieu
35042 Rennes, France

_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to