https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86174

            Bug ID: 86174
           Summary: Poor vectorization/register allocation with omp simd,
                    FMA
           Product: gcc
           Version: 8.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jed at 59A2 dot org
  Target Milestone: ---

Created attachment 44287
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44287&action=edit
Source demonstrating poor optimization

The attached code produces lots of needless spills with gcc-8.1 on Linux.

$ gcc -O3 -Wall -march=skylake -ffast-math -fopenmp -c mm-gcc.c

0000000000000080 <mult+0x80> vbroadcastsd ymm2,QWORD PTR [rdx]
0000000000000085 <mult+0x85> vmovupd ymm1,YMMWORD PTR [rcx]
0000000000000089 <mult+0x89> vmovapd ymm0,ymm1
000000000000008d <mult+0x8d> vfmadd213pd ymm0,ymm2,YMMWORD PTR [rsp]
0000000000000093 <mult+0x93> vmovapd YMMWORD PTR [rsp],ymm0
0000000000000098 <mult+0x98> vmovupd ymm0,YMMWORD PTR [rcx+0x20]
000000000000009d <mult+0x9d> vfmadd213pd ymm2,ymm0,YMMWORD PTR [rsp+0x20]
00000000000000a4 <mult+0xa4> vmovapd YMMWORD PTR [rsp+0x20],ymm2
00000000000000aa <mult+0xaa> vbroadcastsd ymm2,QWORD PTR [rdx+0x400]
00000000000000b3 <mult+0xb3> vmovapd ymm3,ymm1
00000000000000b7 <mult+0xb7> vfmadd213pd ymm3,ymm2,YMMWORD PTR [rsp+0x40]
00000000000000be <mult+0xbe> vmovapd YMMWORD PTR [rsp+0x40],ymm3
00000000000000c4 <mult+0xc4> vfmadd213pd ymm2,ymm0,YMMWORD PTR [rsp+0x60]
00000000000000cb <mult+0xcb> vmovapd YMMWORD PTR [rsp+0x60],ymm2
00000000000000d1 <mult+0xd1> vbroadcastsd ymm2,QWORD PTR [rdx+0x800]
00000000000000da <mult+0xda> vmovapd ymm3,ymm1
00000000000000de <mult+0xde> vfmadd213pd ymm3,ymm2,YMMWORD PTR [rsp+0x80]
00000000000000e8 <mult+0xe8> vmovapd YMMWORD PTR [rsp+0x80],ymm3
00000000000000f1 <mult+0xf1> vfmadd213pd ymm2,ymm0,YMMWORD PTR [rsp+0xa0]
00000000000000fb <mult+0xfb> vmovapd YMMWORD PTR [rsp+0xa0],ymm2
0000000000000104 <mult+0x104> vbroadcastsd ymm2,QWORD PTR [rdx+0xc00]
000000000000010d <mult+0x10d> vfmadd213pd ymm1,ymm2,YMMWORD PTR [rsp+0xc0]
0000000000000117 <mult+0x117> vmovapd YMMWORD PTR [rsp+0xc0],ymm1
0000000000000120 <mult+0x120> vfmadd213pd ymm0,ymm2,YMMWORD PTR [rsp+0xe0]
000000000000012a <mult+0x12a> vmovapd YMMWORD PTR [rsp+0xe0],ymm0
0000000000000133 <mult+0x133> add    rdx,0x8
0000000000000137 <mult+0x137> add    rcx,0x400
000000000000013e <mult+0x13e> cmp    rsi,rcx
0000000000000141 <mult+0x141> jne    0000000000000080 <mult+0x80>

GCC does not issue vector instructions if omp simd is removed.  In contrast,
clang-6 vectorizes well with or without omp simd:

$ clang -O3 -Wall -march=haswell -ffast-math -c mm-gcc.c

00000000000000e0 <mult+0xe0> vmovapd ymm9,ymm6
00000000000000e4 <mult+0xe4> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8-0x800]
00000000000000ee <mult+0xee> vmovupd ymm6,YMMWORD PTR [rax-0x20]
00000000000000f3 <mult+0xf3> vmovupd ymm11,YMMWORD PTR [rax]
00000000000000f7 <mult+0xf7> vfmadd231pd ymm1,ymm6,ymm10
00000000000000fc <mult+0xfc> vfmadd231pd ymm7,ymm11,ymm10
0000000000000101 <mult+0x101> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8-0x400]
000000000000010b <mult+0x10b> vfmadd231pd ymm8,ymm6,ymm10
0000000000000110 <mult+0x110> vfmadd231pd ymm5,ymm11,ymm10
0000000000000115 <mult+0x115> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8]
000000000000011b <mult+0x11b> vfmadd231pd ymm2,ymm6,ymm10
0000000000000120 <mult+0x120> vfmadd231pd ymm3,ymm11,ymm10
0000000000000125 <mult+0x125> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8+0x400]
000000000000012f <mult+0x12f> vfmadd213pd ymm6,ymm10,ymm9
0000000000000134 <mult+0x134> vfmadd231pd ymm4,ymm11,ymm10
0000000000000139 <mult+0x139> add    rax,0x400
000000000000013f <mult+0x13f> add    rbx,0x1
0000000000000143 <mult+0x143> jne    00000000000000e0 <mult+0xe0>

(I used -march=haswell instead of -march=skylake due to
https://bugs.llvm.org/show_bug.cgi?id=37819.)

Reply via email to