>The "r->x" alternative results in "vector" decoding on amdfam10. This is >AMD-speak for microcoded instructions, and AMD optimization manual strongly >recommends avoiding them. I have CC'd Ganesh, maybe he >can provide more >relevant data on the performance impact.
Thanks Uros! Yes, the AMD SWOG recommends precisely what Uros mentions. <snip from SWOG for BD> When moving data from a GPR to an XMM register, use separate store and load instructions to move the data first from the source register to a temporary location in memory and then from memory into the destination register </snip> This is listed as an optimization too. This holds good for all amdfam10 and BD family processors. I have to dig through the performance numbers will try to get them. Regards Ganesh