https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102575

            Bug ID: 102575
           Summary: Failure to optimize double _Complex stores to use
                    largest loads/stores possible
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gabravier at gmail dot com
  Target Milestone: ---

void test(double _Complex *a)
{
    a[0] = 1;
    a[1] = 1;
}

With -O3, on AMD64 GCC outputs this:

test(double _Complex*):
        movsd   xmm1, QWORD PTR .LC0[rip]
        movsd   xmm0, QWORD PTR .LC0[rip+8]
        movsd   QWORD PTR [rdi], xmm1
        movsd   QWORD PTR [rdi+8], xmm0
        movsd   QWORD PTR [rdi+16], xmm1
        movsd   QWORD PTR [rdi+24], xmm0
        ret

Clang instead outputs this:

test(double _Complex*):
        movsd   xmm0, qword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero
        movups  xmmword ptr [rdi], xmm0
        movups  xmmword ptr [rdi + 16], xmm0
        ret

It seems to me like the second output should always be faster.

PS: The difference is even larger with `-mavx2`.

Reply via email to