https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88963

            Bug ID: 88963
           Summary: gcc generates terrible code for vectors of 64+ length
                    which are not natively supported
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
typedef int VInt __attribute__((vector_size(64)));

void test(VInt*__restrict a, VInt*__restrict b, 
    VInt*__restrict c)
{
    *a = *b + *c;
}
[/code]

This code compiled with -O3 -march=skylake in following way:

[asm]
test(int __vector(16)*, int __vector(16)*, int __vector(16)*):
  push rbp
  mov rbp, rsp
  and rsp, -64
  sub rsp, 136
  vmovdqa xmm3, XMMWORD PTR [rsi]
  vmovdqa xmm4, XMMWORD PTR [rsi+16]
  vmovdqa xmm5, XMMWORD PTR [rsi+32]
  vmovdqa xmm6, XMMWORD PTR [rsi+48]
  vmovdqa xmm7, XMMWORD PTR [rdx]
  vmovaps XMMWORD PTR [rsp-56], xmm3
  vmovdqa xmm1, XMMWORD PTR [rdx+16]
  vmovaps XMMWORD PTR [rsp-40], xmm4
  vmovdqa ymm4, YMMWORD PTR [rsp-56]
  vmovdqa xmm2, XMMWORD PTR [rdx+32]
  vmovaps XMMWORD PTR [rsp-8], xmm6
  vmovaps XMMWORD PTR [rsp+8], xmm7
  vmovdqa xmm3, XMMWORD PTR [rdx+48]
  vmovaps XMMWORD PTR [rsp-24], xmm5
  vmovaps XMMWORD PTR [rsp+24], xmm1
  vpaddd ymm0, ymm4, YMMWORD PTR [rsp+8]
  vmovdqa ymm5, YMMWORD PTR [rsp-24]
  vmovaps XMMWORD PTR [rsp+40], xmm2
  vmovaps XMMWORD PTR [rsp+56], xmm3
  vmovdqa xmm2, xmm0
  vmovdqa YMMWORD PTR [rsp-120], ymm0
  vpaddd ymm0, ymm5, YMMWORD PTR [rsp+40]
  vmovdqa xmm6, XMMWORD PTR [rsp-104]
  vmovdqa YMMWORD PTR [rsp-88], ymm0
  vmovdqa xmm7, XMMWORD PTR [rsp-72]
  vmovaps XMMWORD PTR [rdi], xmm2
  vmovaps XMMWORD PTR [rdi+16], xmm6
  vmovaps XMMWORD PTR [rdi+32], xmm0
  vmovaps XMMWORD PTR [rdi+48], xmm7
  vzeroupper
  leave
  ret
[/asm]

Other compilers (clang, icc) produces nice code. This is from clang:

[asm]
test(int __vector(16)*, int __vector(16)*, int __vector(16)*): # @test(int
__vector(16)*, int __vector(16)*, int __vector(16)*)
  vmovdqa ymm0, ymmword ptr [rdx]
  vmovdqa ymm1, ymmword ptr [rdx + 32]
  vpaddd ymm0, ymm0, ymmword ptr [rsi]
  vpaddd ymm1, ymm1, ymmword ptr [rsi + 32]
  vmovdqa ymmword ptr [rdi + 32], ymm1
  vmovdqa ymmword ptr [rdi], ymm0
  vzeroupper
  ret
[/asm]

gcc produces pretty code for -O3 -march=skylake-avx512. Pretty code is also for
vector size 32 with AVX disabled. However for vector size 128 and -O3
-march=skylake-avx512 code is again ugly.

Reply via email to