On Thu, Nov 03, 2011 at 01:37:09PM +0400, Kirill Yukhin wrote:
> > %ymm0 is all ones (this is code from the auto-vectorization).
> > (2) is not useless, %ymm6 contains the mask, for auto-vectorization
> > (3) is useless, it is there just because the current gather insn patterns
> > always use the previous value of the destination register.
> Sure, I am constantly mix Intel/gcc syntax, sorry for confusion
> 
> I've asked guys who are working on vectorizarion in the ICC.
> Seems, we may kick off zerroing of destination.
> Here is extract of answer (one of engineers responsible for
> vectorization in ICC):
> 
> >>> I think zero in this situation is just a garbage value,
> >>> and I don’t see why GCC and ICC need to be garbage to garbage
> >>> compatible. If the programmer is using such a fault handler, he/she
> >>> should know the consequences.

Is that just for the _mm{,256}_i{32,64}gather_{ps,pd,epi32,epi64}
intrinsics or also for _mm{,256}_mask_i{32,64}gather_{ps,pd,epi32,epi64}
if the mask has all high bits set?

That is quite important for me.  If it is just for
the former and not for the latter, then we'd have to introduce
new builtins (have one set of builtins with mask and the other set
without it, matching the intrinsics).  If it is for both, then
this would be implemented as an expand time optimization
(if mask is an SSA_NAME, look at its definition and if it can be
proven to have all high bits set, just use another set of instructions).

So, to formulate my question in a different way, given:

#include <x86intrin.h>

__m256i a, b;

void
f1 (void)
{
  a = _mm256_i64gather_epi64 (NULL, b, 1);
}

void
f2 (void)
{
  __m256i d, e;
  d = _mm256_set_epi64x (1, 2, 3, 4);
  e = _mm256_set_epi64x (-1, -1, -1, -1);
  a = _mm256_mask_i64gather_epi64 (d, NULL, b, e, 1);
}

source, we currently compile this into (-O2 -masm=intel -mavx2,
unrelated insns removed):

f1:
        vpcmpeqd        ymm2, ymm2, ymm2
        vpxor   xmm0, xmm0, xmm0        ! Can this insn be optimized away?
        xor     eax, eax
        vmovdqa ymm1, YMMWORD PTR b[rip]
        vpgatherqq      ymm0, QWORD PTR [rax+ymm1*1], ymm2
        vmovdqa YMMWORD PTR a[rip], ymm0
        vzeroupper
        ret
f2:
        vpcmpeqd        ymm2, ymm2, ymm2
        xor     eax, eax
        vmovdqa ymm0, YMMWORD PTR .LC0[rip]     ! Can this insn be optimized 
away?
        vmovdqa ymm1, YMMWORD PTR b[rip]
        vpgatherqq      ymm0, QWORD PTR [rax+ymm1*1], ymm2
        vmovdqa YMMWORD PTR a[rip], ymm0
        vzeroupper
        ret
        .align 32
.LC0:
        .quad   4
        .quad   3
        .quad   2
        .quad   1

>From the above, I understand for f1's vpxor insn the answer is,
"yes, it can be optimized away", what about f2's vpmovdqa loading ymm0?

        Jakub

Reply via email to