On Tue, 7 Jul 2015, Dutu, Alexandru wrote:

Hi Nilay,

I was asking about the meaning of "upper lane merging" as I was not aware of this behaviour for SSE instructions. Reading through the manual, it seems that the upper octword remains unchanged for the legacy SSE instructions (as Giacomo was mentioning). However, even some SSE instructions are doing zeroing, the extended ones.

Having said all this, I don't see why there needs to be a read of the old value for any particular write, a masked OR is a lot more efficient. The difference between legacy and extended SSE/AVX instructions, for this aspect, becomes just applying a different mask over the entire register before writing (which is just an OR operation) to it. For example, if the register is 256 bits and we are writing the lower 128 bits with a legacy SSE instruction the writing operation should be (128bits_new_value) | (0xFFFFFFFF00000000 & 256bits_old_value). If we are writing with an extended SSE instruction the operation should be (128bits_new_value) | (0x0 & 256bits_old_value). Implementing this in hardware might be more efficient than reading the register before every register write.

However, what the Intel implementation actually does is to save the upper lanes of the registers for every transition from AVX/SSE extended to legacy SSE instructions [1]. The VEX prefix should be a good indicator for these transitions. Also, programmers and compilers are encouraged to reduce these transitions for better performance as it reduces the overhead of saving upper lanes [2]. At a first glance, it seems there are a few instructions which are reading and writing just the upper parts of the vector registers, at least for X86.

Best,
Alex

[1] 
https://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf
[2] https://www.pgroup.com/lit/articles/insider/v3n2a4.htm



Alex, thanks for those articles. I read the first one, yet to read the second. I was surprised to read that Intel saves the upper lanes and then restores them back.


I was wrong in my previous email when I said that we cannot have out-of-order execution. Actually we can. Consider two the following two instruction sequence:

A = B + C
D = A + E.

Here operands are vectors. Assume each has two elements. Then, it is possible that we start with the second instruction before we have completed the first. This is possible if partial results from prior vector instructions are available for use. The current implementation in gem5 allows for this, but the implementation I have posted does not. My opinion is that, for vector operations, supporting wider operations is of more value that allowing for out-of-order amongst less wide operations. Any opinions?

--
Nilay
_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to