https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102575
Bug ID: 102575 Summary: Failure to optimize double _Complex stores to use largest loads/stores possible Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: gabravier at gmail dot com Target Milestone: --- void test(double _Complex *a) { a[0] = 1; a[1] = 1; } With -O3, on AMD64 GCC outputs this: test(double _Complex*): movsd xmm1, QWORD PTR .LC0[rip] movsd xmm0, QWORD PTR .LC0[rip+8] movsd QWORD PTR [rdi], xmm1 movsd QWORD PTR [rdi+8], xmm0 movsd QWORD PTR [rdi+16], xmm1 movsd QWORD PTR [rdi+24], xmm0 ret Clang instead outputs this: test(double _Complex*): movsd xmm0, qword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero movups xmmword ptr [rdi], xmm0 movups xmmword ptr [rdi + 16], xmm0 ret It seems to me like the second output should always be faster. PS: The difference is even larger with `-mavx2`.