https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80372
Marc Glisse <glisse at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization Status|UNCONFIRMED |NEW Last reconfirmed| |2017-04-09 Component|c++ |middle-end Ever confirmed|0 |1 --- Comment #4 from Marc Glisse <glisse at gcc dot gnu.org> --- (using -march=skylake-avx512 which sounds recent enough) MEM[(struct complexD.42555 *)res_1(D) + 16B] = MEM[(const struct complexD.42555 &)res_1(D)]; gcc often has trouble optimizing direct mem-to-mem assignments. If I write the code as: res[1].real(res[0].real()); res[1].imag(res[0].imag()); we have _3 = REALPART_EXPR <MEM[(const struct complex *)res_1(D)]._M_value>; REALPART_EXPR <MEM[(struct complex *)res_1(D) + 16B]._M_value> = _3; _4 = IMAGPART_EXPR <MEM[(const struct complex *)res_1(D)]._M_value>; IMAGPART_EXPR <MEM[(struct complex *)res_1(D) + 16B]._M_value> = _4; which we vectorize (SLP) vect__3.9_8 = MEM[(doubleD.39 *)res_1(D)]; MEM[(doubleD.39 *)res_1(D) + 16B] = vect__3.9_8; and generate vmovupd (%rdi), %xmm0 vmovups %xmm0, 16(%rdi) If I use memcpy(res+1,res,sizeof(*res)), we get: __int128 unsigned _3; _3 = MEM[(char * {ref-all})res_1(D)]; MEM[(char * {ref-all})res_1(D) + 16B] = _3; vmovdqu64 (%rdi), %xmm0 vmovups %xmm0, 16(%rdi)