https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

            Bug ID: 83008
           Summary: [performance] Is it better to avoid extra instructions
                    in data passing between loops?
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: sergey.shalnov at intel dot com
  Target Milestone: ---

I found strange code generated by GCC-8.0/7.x with following command line
options:
-g -Ofast -march=skylake-avx512 -ftree-vectorize

There are not vectorized two loops. 
First one doesn’t vectorized because:
test.c:6:23: note: cost model: the vector iteration cost = 1488 divided by the
scalar iteration cost = 328 is greater or equal to the vectorization factor =
4.
test.c:6:23: note: not vectorized: vectorization not profitable.
test.c:6:23: note: not vectorized: vector version will never be profitable.

Second one doesn’t vectorized because:
test.c:20:23: note: step unknown.
test.c:20:23: note: reduction: not commutative/associative: sum_87 = (int) _61;
test.c:20:23: note: Unknown def-use cycle pattern.
test.c:20:23: note: Unsupported pattern.
test.c:20:23: note: not vectorized: unsupported use in stmt.
test.c:20:23: note: unexpected pattern.

If we look into asm we found strange method to passing data in “tmp” array
between loops:
…loop 1 body…
 13f:   41 8d 04 13             lea    (%r11,%rdx,1),%eax
 143:   c5 f9 6e d8             vmovd  %eax,%xmm3
 147:   c5 e1 62 db             vpunpckldq %xmm3,%xmm3,%xmm3
 14b:   c5 f9 62 c0             vpunpckldq %xmm0,%xmm0,%xmm0
 14f:   c5 f1 62 c9             vpunpckldq %xmm1,%xmm1,%xmm1
 153:   c5 e9 62 d2             vpunpckldq %xmm2,%xmm2,%xmm2
 157:   c5 e9 6c d2             vpunpcklqdq %xmm2,%xmm2,%xmm2
 15b:   c5 f1 6c c9             vpunpcklqdq %xmm1,%xmm1,%xmm1
 15f:   c5 f9 6c c0             vpunpcklqdq %xmm0,%xmm0,%xmm0
 163:   c5 e1 6c db             vpunpcklqdq %xmm3,%xmm3,%xmm3
 167:   c4 e3 6d 38 c9 01       vinserti128 $0x1,%xmm1,%ymm2,%ymm1
 16d:   c4 e3 7d 38 c3 01       vinserti128 $0x1,%xmm3,%ymm0,%ymm0
 173:   62 f3 f5 48 3a c0 01    vinserti64x4 $0x1,%ymm0,%zmm1,%zmm0
 17a:   62 f1 fd 48 7f 44 24    vmovdqa64 %zmm0,-0x40(%rsp)
 181:   ff 
 182:   8b 54 24 e0             mov    -0x20(%rsp),%edx
 186:   03 54 24 f0             add    -0x10(%rsp),%edx
…loop 2 body…

if I'm not mistaken the algorithm looks like following:
1.      Do first loop and keep values in GPR
2.      Move these GPRs to XMMs
3.      Pack these XMMs into YMMs
4.      Pack these YMMs to ZMM
5.      Spill ZMM into stack
6.      Get values from stack to GPRs of the second loop

It might be better, from performance perspective, to pass values from first
loop directly to the second loop with GPRs (without all these vector
registers)?

The reproducer is:
     1  int test(unsigned char * input1, unsigned char * input2)
     2  {
     3      unsigned int tmp[4][4];
     4      unsigned int var0, var1, var2, var3;
     5      int sum = 0;
     6      for (int i = 0; i < 4; i++, input1 += 4, input2 += 4) {
     7          var0 = (input1[0] + input2[0]) + (input1[4] + input2[4]);
     8          var1 = (input1[1] + input2[1]) + (input1[5] + input2[5]);
     9          var2 = (input1[2] + input2[2]) + (input1[6] + input2[6]);
    10          var3 = (input1[3] + input2[3]) + (input1[7] + input2[7]);
    11          int inter0 = var0 + var1;
    12          int inter1 = var0 + var1;
    13          int inter2 = var2 + var3;
    14          int inter3 = var2 + var3;
    15          tmp[i][0] = inter0 + inter2;
    16          tmp[i][2] = inter0 + inter2;
    17          tmp[i][1] = inter1 + inter3;
    18          tmp[i][3] = inter1 + inter3;
    19      }
    20      for (int i = 0; i < 4; i++) {
    21          int inter0 = tmp[0][i] + tmp[1][i];
    22          int inter1 = tmp[0][i] + tmp[1][i];
    23          int inter2 = tmp[2][i] + tmp[3][i];
    24          int inter3 = tmp[2][i] + tmp[3][i];
    25          var0 = inter0 + inter2;
    26          var2 = inter0 + inter2;
    27          var1 = inter1 + inter3;
    28          var3 = inter1 + inter3;
    29          sum += var0 + var1 + var2 + var3;
    30      }
    31
    32      return sum;
    33  }

Sergey

Reply via email to