On Thu, Nov 03, 2011 at 01:37:09PM +0400, Kirill Yukhin wrote: > > %ymm0 is all ones (this is code from the auto-vectorization). > > (2) is not useless, %ymm6 contains the mask, for auto-vectorization > > (3) is useless, it is there just because the current gather insn patterns > > always use the previous value of the destination register. > Sure, I am constantly mix Intel/gcc syntax, sorry for confusion > > I've asked guys who are working on vectorizarion in the ICC. > Seems, we may kick off zerroing of destination. > Here is extract of answer (one of engineers responsible for > vectorization in ICC): > > >>> I think zero in this situation is just a garbage value, > >>> and I don’t see why GCC and ICC need to be garbage to garbage > >>> compatible. If the programmer is using such a fault handler, he/she > >>> should know the consequences.
Is that just for the _mm{,256}_i{32,64}gather_{ps,pd,epi32,epi64} intrinsics or also for _mm{,256}_mask_i{32,64}gather_{ps,pd,epi32,epi64} if the mask has all high bits set? That is quite important for me. If it is just for the former and not for the latter, then we'd have to introduce new builtins (have one set of builtins with mask and the other set without it, matching the intrinsics). If it is for both, then this would be implemented as an expand time optimization (if mask is an SSA_NAME, look at its definition and if it can be proven to have all high bits set, just use another set of instructions). So, to formulate my question in a different way, given: #include <x86intrin.h> __m256i a, b; void f1 (void) { a = _mm256_i64gather_epi64 (NULL, b, 1); } void f2 (void) { __m256i d, e; d = _mm256_set_epi64x (1, 2, 3, 4); e = _mm256_set_epi64x (-1, -1, -1, -1); a = _mm256_mask_i64gather_epi64 (d, NULL, b, e, 1); } source, we currently compile this into (-O2 -masm=intel -mavx2, unrelated insns removed): f1: vpcmpeqd ymm2, ymm2, ymm2 vpxor xmm0, xmm0, xmm0 ! Can this insn be optimized away? xor eax, eax vmovdqa ymm1, YMMWORD PTR b[rip] vpgatherqq ymm0, QWORD PTR [rax+ymm1*1], ymm2 vmovdqa YMMWORD PTR a[rip], ymm0 vzeroupper ret f2: vpcmpeqd ymm2, ymm2, ymm2 xor eax, eax vmovdqa ymm0, YMMWORD PTR .LC0[rip] ! Can this insn be optimized away? vmovdqa ymm1, YMMWORD PTR b[rip] vpgatherqq ymm0, QWORD PTR [rax+ymm1*1], ymm2 vmovdqa YMMWORD PTR a[rip], ymm0 vzeroupper ret .align 32 .LC0: .quad 4 .quad 3 .quad 2 .quad 1 >From the above, I understand for f1's vpxor insn the answer is, "yes, it can be optimized away", what about f2's vpmovdqa loading ymm0? Jakub