https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
To me the GCC 11 emitted code looks much worse than what trunk generates,
typedef float V __attribute__((vector_size (sizeof (float) * 16)));

void
foo (V *x, V *y)
{
  V r = *x, a = *y;
  for (int i = 0; i < 65536; ++i)
    r = r + a;
  *x = r;
}
-O3 -mavx used to be
        .p2align 4,,10
        .p2align 3
.L2:
        vaddps  -56(%rsp), %ymm4, %ymm0
        vaddps  -24(%rsp), %ymm5, %ymm2
        vmovdqa %xmm0, %xmm1
        vmovaps %ymm0, -120(%rsp)
        vmovdqa %xmm2, %xmm0
        vmovdqa -104(%rsp), %xmm3
        vmovaps %ymm2, -88(%rsp)
        vmovdqa %xmm2, -24(%rsp)
        vmovdqa -72(%rsp), %xmm2
        vmovdqa %xmm1, -56(%rsp)
        vmovdqa %xmm3, -40(%rsp)
        vmovdqa %xmm2, -8(%rsp)
        subl    $1, %eax
        jne     .L2
in GCC 11 and just
.L2:
        vaddps  -56(%rsp), %ymm2, %ymm1
        vaddps  -24(%rsp), %ymm3, %ymm0
        vmovdqa %ymm1, -56(%rsp)
        vmovdqa %ymm0, -24(%rsp)
        subl    $1, %eax
        jne     .L2
on the trunk.  That said, ideally it would not touch touch the memory at all.
forwprop4 already manages to hoist the BIT_FIELD_REFs for the y halves out of
the loop:
   <bb 2> [local count: 10737416]:
   r_5 = *x_4(D);
-  a_7 = *y_6(D);
+  _11 = BIT_FIELD_REF <*y_6(D), 256, 256>;
+  _14 = BIT_FIELD_REF <*y_6(D), 256, 0>;

   <bb 3> [local count: 1063004408]:
   # r_13 = PHI <r_9(3), r_5(2)>
   # ivtmp_2 = PHI <ivtmp_1(3), 65536(2)>
-  _14 = BIT_FIELD_REF <a_7, 256, 0>;
   _15 = BIT_FIELD_REF <r_13, 256, 0>;
   _10 = _14 + _15;
-  _11 = BIT_FIELD_REF <a_7, 256, 256>;
   _12 = BIT_FIELD_REF <r_13, 256, 256>;
   _16 = _11 + _12;
   r_9 = {_10, _16};
   ivtmp_1 = ivtmp_2 + 4294967295;
   if (ivtmp_1 != 0)
     goto <bb 3>; [98.99%]
   else
     goto <bb 4>; [1.01%]
but r is a reduction and nothing after the vector lowering figures out that
it would be beneficial to change it even further
   <bb 2> [local count: 10737416]:
   r_5 = *x_4(D);
   _11 = BIT_FIELD_REF <*y_6(D), 256, 256>;
   _14 = BIT_FIELD_REF <*y_6(D), 256, 0>;
   _200 = BIT_FIELD_REF <r_5, 256, 0>;
   _201 = BIT_FIELD_REF <r_5, 256, 256>;

   <bb 3> [local count: 1063004408]:
   # _202 = PHI <_10, _200(2)>
   # _203 = PHI <_16, _201(2)>
   # ivtmp_2 = PHI <ivtmp_1(3), 65536(2)>
   _10 = _14 + _202;
   _16 = _11 + _203;
   ivtmp_1 = ivtmp_2 + 4294967295;
   if (ivtmp_1 != 0)
     goto <bb 3>; [98.99%]
   else
     goto <bb 4>; [1.01%]

   <bb 4>:
   r_13 = {_202, _203};
(kind of SRA for vector parts).

Reply via email to