https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
Bug ID: 83008 Summary: [performance] Is it better to avoid extra instructions in data passing between loops? Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: sergey.shalnov at intel dot com Target Milestone: --- I found strange code generated by GCC-8.0/7.x with following command line options: -g -Ofast -march=skylake-avx512 -ftree-vectorize There are not vectorized two loops. First one doesn’t vectorized because: test.c:6:23: note: cost model: the vector iteration cost = 1488 divided by the scalar iteration cost = 328 is greater or equal to the vectorization factor = 4. test.c:6:23: note: not vectorized: vectorization not profitable. test.c:6:23: note: not vectorized: vector version will never be profitable. Second one doesn’t vectorized because: test.c:20:23: note: step unknown. test.c:20:23: note: reduction: not commutative/associative: sum_87 = (int) _61; test.c:20:23: note: Unknown def-use cycle pattern. test.c:20:23: note: Unsupported pattern. test.c:20:23: note: not vectorized: unsupported use in stmt. test.c:20:23: note: unexpected pattern. If we look into asm we found strange method to passing data in “tmp” array between loops: …loop 1 body… 13f: 41 8d 04 13 lea (%r11,%rdx,1),%eax 143: c5 f9 6e d8 vmovd %eax,%xmm3 147: c5 e1 62 db vpunpckldq %xmm3,%xmm3,%xmm3 14b: c5 f9 62 c0 vpunpckldq %xmm0,%xmm0,%xmm0 14f: c5 f1 62 c9 vpunpckldq %xmm1,%xmm1,%xmm1 153: c5 e9 62 d2 vpunpckldq %xmm2,%xmm2,%xmm2 157: c5 e9 6c d2 vpunpcklqdq %xmm2,%xmm2,%xmm2 15b: c5 f1 6c c9 vpunpcklqdq %xmm1,%xmm1,%xmm1 15f: c5 f9 6c c0 vpunpcklqdq %xmm0,%xmm0,%xmm0 163: c5 e1 6c db vpunpcklqdq %xmm3,%xmm3,%xmm3 167: c4 e3 6d 38 c9 01 vinserti128 $0x1,%xmm1,%ymm2,%ymm1 16d: c4 e3 7d 38 c3 01 vinserti128 $0x1,%xmm3,%ymm0,%ymm0 173: 62 f3 f5 48 3a c0 01 vinserti64x4 $0x1,%ymm0,%zmm1,%zmm0 17a: 62 f1 fd 48 7f 44 24 vmovdqa64 %zmm0,-0x40(%rsp) 181: ff 182: 8b 54 24 e0 mov -0x20(%rsp),%edx 186: 03 54 24 f0 add -0x10(%rsp),%edx …loop 2 body… if I'm not mistaken the algorithm looks like following: 1. Do first loop and keep values in GPR 2. Move these GPRs to XMMs 3. Pack these XMMs into YMMs 4. Pack these YMMs to ZMM 5. Spill ZMM into stack 6. Get values from stack to GPRs of the second loop It might be better, from performance perspective, to pass values from first loop directly to the second loop with GPRs (without all these vector registers)? The reproducer is: 1 int test(unsigned char * input1, unsigned char * input2) 2 { 3 unsigned int tmp[4][4]; 4 unsigned int var0, var1, var2, var3; 5 int sum = 0; 6 for (int i = 0; i < 4; i++, input1 += 4, input2 += 4) { 7 var0 = (input1[0] + input2[0]) + (input1[4] + input2[4]); 8 var1 = (input1[1] + input2[1]) + (input1[5] + input2[5]); 9 var2 = (input1[2] + input2[2]) + (input1[6] + input2[6]); 10 var3 = (input1[3] + input2[3]) + (input1[7] + input2[7]); 11 int inter0 = var0 + var1; 12 int inter1 = var0 + var1; 13 int inter2 = var2 + var3; 14 int inter3 = var2 + var3; 15 tmp[i][0] = inter0 + inter2; 16 tmp[i][2] = inter0 + inter2; 17 tmp[i][1] = inter1 + inter3; 18 tmp[i][3] = inter1 + inter3; 19 } 20 for (int i = 0; i < 4; i++) { 21 int inter0 = tmp[0][i] + tmp[1][i]; 22 int inter1 = tmp[0][i] + tmp[1][i]; 23 int inter2 = tmp[2][i] + tmp[3][i]; 24 int inter3 = tmp[2][i] + tmp[3][i]; 25 var0 = inter0 + inter2; 26 var2 = inter0 + inter2; 27 var1 = inter1 + inter3; 28 var3 = inter1 + inter3; 29 sum += var0 + var1 + var2 + var3; 30 } 31 32 return sum; 33 } Sergey