https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jakub at gcc dot gnu.org
--- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
To me the GCC 11 emitted code looks much worse than what trunk generates,
typedef float V __attribute__((vector_size (sizeof (float) * 16)));
void
foo (V *x, V *y)
{
V r = *x, a = *y;
for (int i = 0; i < 65536; ++i)
r = r + a;
*x = r;
}
-O3 -mavx used to be
.p2align 4,,10
.p2align 3
.L2:
vaddps -56(%rsp), %ymm4, %ymm0
vaddps -24(%rsp), %ymm5, %ymm2
vmovdqa %xmm0, %xmm1
vmovaps %ymm0, -120(%rsp)
vmovdqa %xmm2, %xmm0
vmovdqa -104(%rsp), %xmm3
vmovaps %ymm2, -88(%rsp)
vmovdqa %xmm2, -24(%rsp)
vmovdqa -72(%rsp), %xmm2
vmovdqa %xmm1, -56(%rsp)
vmovdqa %xmm3, -40(%rsp)
vmovdqa %xmm2, -8(%rsp)
subl $1, %eax
jne .L2
in GCC 11 and just
.L2:
vaddps -56(%rsp), %ymm2, %ymm1
vaddps -24(%rsp), %ymm3, %ymm0
vmovdqa %ymm1, -56(%rsp)
vmovdqa %ymm0, -24(%rsp)
subl $1, %eax
jne .L2
on the trunk. That said, ideally it would not touch touch the memory at all.
forwprop4 already manages to hoist the BIT_FIELD_REFs for the y halves out of
the loop:
<bb 2> [local count: 10737416]:
r_5 = *x_4(D);
- a_7 = *y_6(D);
+ _11 = BIT_FIELD_REF <*y_6(D), 256, 256>;
+ _14 = BIT_FIELD_REF <*y_6(D), 256, 0>;
<bb 3> [local count: 1063004408]:
# r_13 = PHI <r_9(3), r_5(2)>
# ivtmp_2 = PHI <ivtmp_1(3), 65536(2)>
- _14 = BIT_FIELD_REF <a_7, 256, 0>;
_15 = BIT_FIELD_REF <r_13, 256, 0>;
_10 = _14 + _15;
- _11 = BIT_FIELD_REF <a_7, 256, 256>;
_12 = BIT_FIELD_REF <r_13, 256, 256>;
_16 = _11 + _12;
r_9 = {_10, _16};
ivtmp_1 = ivtmp_2 + 4294967295;
if (ivtmp_1 != 0)
goto <bb 3>; [98.99%]
else
goto <bb 4>; [1.01%]
but r is a reduction and nothing after the vector lowering figures out that
it would be beneficial to change it even further
<bb 2> [local count: 10737416]:
r_5 = *x_4(D);
_11 = BIT_FIELD_REF <*y_6(D), 256, 256>;
_14 = BIT_FIELD_REF <*y_6(D), 256, 0>;
_200 = BIT_FIELD_REF <r_5, 256, 0>;
_201 = BIT_FIELD_REF <r_5, 256, 256>;
<bb 3> [local count: 1063004408]:
# _202 = PHI <_10, _200(2)>
# _203 = PHI <_16, _201(2)>
# ivtmp_2 = PHI <ivtmp_1(3), 65536(2)>
_10 = _14 + _202;
_16 = _11 + _203;
ivtmp_1 = ivtmp_2 + 4294967295;
if (ivtmp_1 != 0)
goto <bb 3>; [98.99%]
else
goto <bb 4>; [1.01%]
<bb 4>:
r_13 = {_202, _203};
(kind of SRA for vector parts).