https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95866
Bug ID: 95866
Summary: vectorized shift with scalar argument not correctly
costed
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: rguenth at gcc dot gnu.org
Target Milestone: ---
int x[4];
void foo(int i)
{
i = (i+1) & 31;
x[0] = (x[0] << i) + i;
x[1] = (x[1] << i) + i;
x[2] = (x[2] << i) + i;
x[3] = (x[3] << i) + i;
}
for this testcase we're not correctly assessing that we leave the
scalar computation for (i+1) & 31 around for the generated
vectorized shift by a scalar argument. Since we're in need of
the vector of { i,... } for the vectorized add we ideally should
simply use lane zero for the shift operand rather than the original
SSA name as we do. Currently:
_22 = {i_14(D), i_14(D), i_14(D), i_14(D)};
vect_cst__23 = _22;
vect_cst__24 = { 1, 1, 1, 1 };
vect__1.6_25 = vect_cst__23 + vect_cst__24;
vect_cst__26 = { 31, 31, 31, 31 };
vect_i_15.7_27 = vect__1.6_25 & vect_cst__26;
_1 = i_14(D) + 1;
i_15 = _1 & 31;
vect__2.5_21 = MEM <vector(4) int> [(int *)&x];
vect__3.8_28 = vect__2.5_21 << i_15;
vect__4.9_29 = vect__3.8_28 + vect_i_15.7_27;
MEM <vector(4) int> [(int *)&x] = vect__4.9_29;
return;
where arguably we should have done the (i+1) & 31 compute with scalars
and broadcast the result. Assembly:
foo:
.LFB0:
.cfi_startproc
leal 1(%rdi), %eax
movdqa x(%rip), %xmm1
movd %edi, %xmm3
andl $31, %eax
pshufd $0, %xmm3, %xmm0
paddd .LC0(%rip), %xmm0
movq %rax, %xmm2
pand .LC1(%rip), %xmm0
pslld %xmm2, %xmm1
paddd %xmm1, %xmm0
movaps %xmm0, x(%rip)
ret
which is really bad. Even with that optimization simulated by using
'i' as provided by the function argument we generate
foo:
.LFB0:
.cfi_startproc
movdqa x(%rip), %xmm1
movslq %edi, %rax
movd %edi, %xmm3
movq %rax, %xmm2
pshufd $0, %xmm3, %xmm0
pslld %xmm2, %xmm1
paddd %xmm1, %xmm0
movaps %xmm0, x(%rip)
ret
so RTL optimizations do not recover here either. combine sees
(insn 8 3 9 2 (set (reg:DI 89 [ i ])
(sign_extend:DI (reg/v:SI 86 [ i ]))) "t.c":5:16 147
{*extendsidi2_rex64}
(nil))
(insn 10 9 11 2 (set (reg:V4SI 90 [ vect__2.6 ])
(ashift:V4SI (reg:V4SI 91 [ MEM <vector(4) int> [(int *)&x] ])
(reg:DI 89 [ i ]))) "t.c":5:16 3739 {ashlv4si3}
(expr_list:REG_DEAD (reg:V4SI 91 [ MEM <vector(4) int> [(int *)&x] ])
(expr_list:REG_DEAD (reg:DI 89 [ i ])
(nil))))
(insn 11 10 12 2 (set (reg:V4SI 92)
(vec_duplicate:V4SI (reg/v:SI 86 [ i ]))) "t.c":5:22 5169
{*vec_dupv4si}
(expr_list:REG_DEAD (reg/v:SI 86 [ i ])
(nil)))
(insn 12 11 13 2 (set (reg:V4SI 93 [ vect__3.7 ])
(plus:V4SI (reg:V4SI 90 [ vect__2.6 ])
(reg:V4SI 92))) "t.c":5:22 3545 {*addv4si3}
so the opportunity to "back-propagate" the vec_duplicate for the shift
isn't appearant - nor would it ever consider such thing.
So we probably should try to fix this in the vectorizer. IIRC this support
for non-external/constant scalar shift args is reasonably new (g:49eab32e6e7).
Also if there's a vector-vector shift we should probably prefer that if
we already have a vector.