[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once

rguenth at gcc dot gnu.org via Gcc-bugs Fri, 09 Apr 2021 00:05:52 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot 
gnu.org
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-04-09
             Status|UNCONFIRMED                 |ASSIGNED

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  While we manage to analyze for the "perfect" solution" we fail
because dependence testing doesn't handle a piece, this throws away half
of the vectorization.  We do actually see that we'll retain the scalar
loads and computations but still doing three vector loads and a vector add
seems cheaper than doing four scalar stores:

0x1fdb5a0 x_2(D)->a 1 times unaligned_load (misalign -1) costs 12 in body
0x1fdb5a0 y1_3(D)->a 1 times unaligned_load (misalign -1) costs 12 in body
0x1fdb5a0 _13 + _14 1 times vector_stmt costs 4 in body
0x1fdb5a0 _15 1 times unaligned_store (misalign -1) costs 12 in body
0x1fddcb0 _15 1 times scalar_store costs 12 in body
0x1fddcb0 _18 1 times scalar_store costs 12 in body
0x1fddcb0 _21 1 times scalar_store costs 12 in body
0x1fddcb0 _24 1 times scalar_store costs 12 in body
t.C:28:1: note:  Cost model analysis:
  Vector inside of basic block cost: 40
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 48
t.C:28:1: note:  Basic block will be vectorized using SLP

now, fortunately GCC 11 will improve on this [a bit] and we'll produce

_Z4testR1ARKS_S2_:
.LFB2:
        .cfi_startproc
        movdqu  (%rsi), %xmm0
        movdqu  (%rdi), %xmm1
        paddd   %xmm1, %xmm0
        movups  %xmm0, (%rdi)
        movd    %xmm0, %eax
        subl    (%rdx), %eax
        movl    %eax, (%rdi)
        pextrd  $1, %xmm0, %eax
        subl    4(%rdx), %eax
        movl    %eax, 4(%rdi)
        pextrd  $2, %xmm0, %eax
        subl    8(%rdx), %eax
        movl    %eax, 8(%rdi)
        pextrd  $3, %xmm0, %eax
        subl    12(%rdx), %eax
        movl    %eax, 12(%rdi)
        ret

which is not re-doing the scalar loads/adds but instead uses the vector
result.  Still the same dependence issue is present:

t.C:16:11: missed:   can't determine dependence between y1_3(D)->b and
x_2(D)->a
t.C:16:11: note:  removing SLP instance operations starting from: x_2(D)->a =
_6;

the scalar code before vectorization looks like

  <bb 2> [local count: 1073741824]:
  _13 = x_2(D)->a;
  _14 = y1_3(D)->a;
  _15 = _13 + _14;
  x_2(D)->a = _15;
  _16 = x_2(D)->b;
  _17 = y1_3(D)->b;  <---
  _18 = _16 + _17;
  x_2(D)->b = _18;
  _19 = x_2(D)->c;
  _20 = y1_3(D)->c;
  _21 = _19 + _20;
  x_2(D)->c = _21;
  _22 = x_2(D)->d;
  _23 = y1_3(D)->d;
  _24 = _22 + _23;
  x_2(D)->d = _24;
  _5 = y2_4(D)->a;
  _6 = _15 - _5;
  x_2(D)->a = _6;  <---
  _7 = y2_4(D)->b;
  _8 = _18 - _7;
  x_2(D)->b = _8;
  _9 = y2_4(D)->c;
  _10 = _21 - _9;
  x_2(D)->c = _10;
  _11 = y2_4(D)->d;
  _12 = _24 - _11;
  x_2(D)->d = _12;
  return;


Using

void test(A& __restrict x, A const& y1, A const& y2)
{
    x += y1;
    x -= y2;
}

produces optimal assembly even with GCC 10:

_Z4testR1ARKS_S2_:
.LFB2:
        .cfi_startproc
        movdqu  (%rsi), %xmm0
        movdqu  (%rdx), %xmm1
        movdqu  (%rdi), %xmm2
        psubd   %xmm1, %xmm0
        paddd   %xmm2, %xmm0
        movups  %xmm0, (%rdi)
        ret

note that I think we should be able to handle the dependences even without
the __restrict annotation.

[Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once

Reply via email to