Re: [gem5-dev] Review Request 2828: cpu: implements vector registers

Nilay Vaish Fri, 10 Jul 2015 09:50:16 -0700

On Tue, 7 Jul 2015, Dutu, Alexandru wrote:

Hi Nilay,
I was asking about the meaning of "upper lane merging" as I was notaware of this behaviour for SSE instructions. Reading through themanual, it seems that the upper octword remains unchanged for the legacySSE instructions (as Giacomo was mentioning). However, even some SSEinstructions are doing zeroing, the extended ones.
Having said all this, I don't see why there needs to be a read of theold value for any particular write, a masked OR is a lot more efficient.The difference between legacy and extended SSE/AVX instructions, forthis aspect, becomes just applying a different mask over the entireregister before writing (which is just an OR operation) to it. Forexample, if the register is 256 bits and we are writing the lower 128bits with a legacy SSE instruction the writing operation should be(128bits_new_value) | (0xFFFFFFFF00000000 & 256bits_old_value). If weare writing with an extended SSE instruction the operation should be(128bits_new_value) | (0x0 & 256bits_old_value). Implementing this inhardware might be more efficient than reading the register before everyregister write.
However, what the Intel implementation actually does is to save theupper lanes of the registers for every transition from AVX/SSE extendedto legacy SSE instructions [1]. The VEX prefix should be a goodindicator for these transitions. Also, programmers and compilers areencouraged to reduce these transitions for better performance as itreduces the overhead of saving upper lanes [2]. At a first glance, itseems there are a few instructions which are reading and writing justthe upper parts of the vector registers, at least for X86.
Best,
Alex

[1] 
https://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf
[2] https://www.pgroup.com/lit/articles/insider/v3n2a4.htm

Alex, thanks for those articles. I read the first one, yet to read thesecond. I was surprised to read that Intel saves the upper lanes andthen restores them back.

I was wrong in my previous email when I said that we cannot haveout-of-order execution. Actually we can. Consider two the following twoinstruction sequence:


A = B + C
D = A + E.

Here operands are vectors. Assume each has two elements. Then, it ispossible that we start with the second instruction before we havecompleted the first. This is possible if partial results from priorvector instructions are available for use. The current implementation ingem5 allows for this, but the implementation I have posted does not. Myopinion is that, for vector operations, supporting wider operations is ofmore value that allowing for out-of-order amongst less wide operations.Any opinions?


--
Nilay
_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] Review Request 2828: cpu: implements vector registers

Reply via email to