https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88963
Bug ID: 88963 Summary: gcc generates terrible code for vectors of 64+ length which are not natively supported Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- [code] typedef int VInt __attribute__((vector_size(64))); void test(VInt*__restrict a, VInt*__restrict b, VInt*__restrict c) { *a = *b + *c; } [/code] This code compiled with -O3 -march=skylake in following way: [asm] test(int __vector(16)*, int __vector(16)*, int __vector(16)*): push rbp mov rbp, rsp and rsp, -64 sub rsp, 136 vmovdqa xmm3, XMMWORD PTR [rsi] vmovdqa xmm4, XMMWORD PTR [rsi+16] vmovdqa xmm5, XMMWORD PTR [rsi+32] vmovdqa xmm6, XMMWORD PTR [rsi+48] vmovdqa xmm7, XMMWORD PTR [rdx] vmovaps XMMWORD PTR [rsp-56], xmm3 vmovdqa xmm1, XMMWORD PTR [rdx+16] vmovaps XMMWORD PTR [rsp-40], xmm4 vmovdqa ymm4, YMMWORD PTR [rsp-56] vmovdqa xmm2, XMMWORD PTR [rdx+32] vmovaps XMMWORD PTR [rsp-8], xmm6 vmovaps XMMWORD PTR [rsp+8], xmm7 vmovdqa xmm3, XMMWORD PTR [rdx+48] vmovaps XMMWORD PTR [rsp-24], xmm5 vmovaps XMMWORD PTR [rsp+24], xmm1 vpaddd ymm0, ymm4, YMMWORD PTR [rsp+8] vmovdqa ymm5, YMMWORD PTR [rsp-24] vmovaps XMMWORD PTR [rsp+40], xmm2 vmovaps XMMWORD PTR [rsp+56], xmm3 vmovdqa xmm2, xmm0 vmovdqa YMMWORD PTR [rsp-120], ymm0 vpaddd ymm0, ymm5, YMMWORD PTR [rsp+40] vmovdqa xmm6, XMMWORD PTR [rsp-104] vmovdqa YMMWORD PTR [rsp-88], ymm0 vmovdqa xmm7, XMMWORD PTR [rsp-72] vmovaps XMMWORD PTR [rdi], xmm2 vmovaps XMMWORD PTR [rdi+16], xmm6 vmovaps XMMWORD PTR [rdi+32], xmm0 vmovaps XMMWORD PTR [rdi+48], xmm7 vzeroupper leave ret [/asm] Other compilers (clang, icc) produces nice code. This is from clang: [asm] test(int __vector(16)*, int __vector(16)*, int __vector(16)*): # @test(int __vector(16)*, int __vector(16)*, int __vector(16)*) vmovdqa ymm0, ymmword ptr [rdx] vmovdqa ymm1, ymmword ptr [rdx + 32] vpaddd ymm0, ymm0, ymmword ptr [rsi] vpaddd ymm1, ymm1, ymmword ptr [rsi + 32] vmovdqa ymmword ptr [rdi + 32], ymm1 vmovdqa ymmword ptr [rdi], ymm0 vzeroupper ret [/asm] gcc produces pretty code for -O3 -march=skylake-avx512. Pretty code is also for vector size 32 with AVX disabled. However for vector size 128 and -O3 -march=skylake-avx512 code is again ugly.