On Tue, 7 Jul 2015, Dutu, Alexandru wrote:
Hi Nilay,
I was asking about the meaning of "upper lane merging" as I was not
aware of this behaviour for SSE instructions. Reading through the
manual, it seems that the upper octword remains unchanged for the legacy
SSE instructions (as Giacomo was mentioning). However, even some SSE
instructions are doing zeroing, the extended ones.
Having said all this, I don't see why there needs to be a read of the
old value for any particular write, a masked OR is a lot more efficient.
The difference between legacy and extended SSE/AVX instructions, for
this aspect, becomes just applying a different mask over the entire
register before writing (which is just an OR operation) to it. For
example, if the register is 256 bits and we are writing the lower 128
bits with a legacy SSE instruction the writing operation should be
(128bits_new_value) | (0xFFFFFFFF00000000 & 256bits_old_value). If we
are writing with an extended SSE instruction the operation should be
(128bits_new_value) | (0x0 & 256bits_old_value). Implementing this in
hardware might be more efficient than reading the register before every
register write.
However, what the Intel implementation actually does is to save the
upper lanes of the registers for every transition from AVX/SSE extended
to legacy SSE instructions [1]. The VEX prefix should be a good
indicator for these transitions. Also, programmers and compilers are
encouraged to reduce these transitions for better performance as it
reduces the overhead of saving upper lanes [2]. At a first glance, it
seems there are a few instructions which are reading and writing just
the upper parts of the vector registers, at least for X86.
Best,
Alex
[1]
https://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf
[2] https://www.pgroup.com/lit/articles/insider/v3n2a4.htm
Alex, thanks for those articles. I read the first one, yet to read the
second. I was surprised to read that Intel saves the upper lanes and
then restores them back.
I was wrong in my previous email when I said that we cannot have
out-of-order execution. Actually we can. Consider two the following two
instruction sequence:
A = B + C
D = A + E.
Here operands are vectors. Assume each has two elements. Then, it is
possible that we start with the second instruction before we have
completed the first. This is possible if partial results from prior
vector instructions are available for use. The current implementation in
gem5 allows for this, but the implementation I have posted does not. My
opinion is that, for vector operations, supporting wider operations is of
more value that allowing for out-of-order amongst less wide operations.
Any opinions?
--
Nilay
_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev