Hi Arthur, Apologies on the delay in responding. Thanks for looking into this. As to why it is that way, I think it's mostly an accident of history. If you were to reimplement the store pair as a single 128 bit store we would be in support of the patch entering mainline gem5. It is reasonable to expect forwarding between these operations without the rescheduling problem for well-behaved code, so such a patch could improve fidelity in those cases.
Curtis -----Original Message----- From: gem5-users [mailto:[email protected]] On Behalf Of Arthur Perais Sent: Friday, January 15, 2016 3:59 AM To: gem5 users mailing list Subject: [gem5-users] arm, aarch64: asymmetry between load-pair and store-pair implementation Hello fellow gem5 users, I recently stumbled into an interesting behavior in a program that uses a lot of store-pair/load-pair instructions (which if I understand correctly act as the push and pop of Aarch64). What happens is that store-pair (of two 64-bit registers) is implemented as two 64-bit store micro-ops. Conversely, load-pair is implemented as a single 128-bit micro-op. As a result, it is not possible for load-pair to benefit from store-to-load forwarding from a corresponding store pair instruction, because an instruction cannot get its data from two different (although contiguous in that case) SQ entries. So I ended up we a lot of "rescheduledLoads". Those loads wait for the store they partially depend on (generally one of the two store-pair micro-op) to write-back before issuing again. While this appears mostly inconsequential (I considered the olden suite, I know it is really old, but it finishes in hours so it's nice for debugging), one of the program (bh) has a lot of "rescheduledLoads" (231M for 5.4B committed microops). As a result, I tried modifying how load/store-pair are cracked, either one micro-op for each, or two micro-ops for each. On olden, the former has speedup of 1.02 (geomean) while the latter has 0.989 (geomean) on the arm_detailed CPU. So in general ipc does not change, but on bh, the former gives a speedup of 11% while the latter gives 2%. I reckon that the difference might grow higher on a more aggressive CPU, since some of the performance loss due to STLF stall may be hidden by resource contention (ROB/IQ full) in arm_detailed. If the problem is really the inability to do STLF, then using two microops for both instructions should help, because a load-pair will be allowed to match other regular 64-bit stores in addition to store micro-ops of store pairs. On the other hand, more microops will enter the pipe. However, the two micro-op solution does not improve performance as much in bh as the single micro-op one, so it appears that having less microops is more important than having less rescheduledLoads. The only issue is that the single micro-op solution assumes that the SQ can hold 128 bits per entry (which might explain the initial asymmetric implementation). So the first question is if anyone can provide insight into the asymmetric design, and the second one is if it sounds reasonable to do the one micro-op only solution (although for completeness, any 128-bit store currently cracked into two micro-ops should be reimplemented). Cheers, Arthur Perais. -- Arthur Perais INRIA Bretagne Atlantique Bâtiment 12E, Bureau E303, Campus de Beaulieu 35042 Rennes, France _______________________________________________ gem5-users mailing list [email protected] http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. _______________________________________________ gem5-users mailing list [email protected] http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
