https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80372

Marc Glisse <glisse at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2017-04-09
          Component|c++                         |middle-end
     Ever confirmed|0                           |1

--- Comment #4 from Marc Glisse <glisse at gcc dot gnu.org> ---
(using -march=skylake-avx512 which sounds recent enough)

  MEM[(struct complexD.42555 *)res_1(D) + 16B] = MEM[(const struct
complexD.42555 &)res_1(D)];

gcc often has trouble optimizing direct mem-to-mem assignments. If I write the
code as:

  res[1].real(res[0].real());
  res[1].imag(res[0].imag());

we have

  _3 = REALPART_EXPR <MEM[(const struct complex *)res_1(D)]._M_value>;
  REALPART_EXPR <MEM[(struct complex *)res_1(D) + 16B]._M_value> = _3;
  _4 = IMAGPART_EXPR <MEM[(const struct complex *)res_1(D)]._M_value>;
  IMAGPART_EXPR <MEM[(struct complex *)res_1(D) + 16B]._M_value> = _4;

which we vectorize (SLP)

  vect__3.9_8 = MEM[(doubleD.39 *)res_1(D)];
  MEM[(doubleD.39 *)res_1(D) + 16B] = vect__3.9_8;

and generate

        vmovupd (%rdi), %xmm0
        vmovups %xmm0, 16(%rdi)

If I use memcpy(res+1,res,sizeof(*res)), we get:

  __int128 unsigned _3;
  _3 = MEM[(char * {ref-all})res_1(D)];
  MEM[(char * {ref-all})res_1(D) + 16B] = _3;

        vmovdqu64       (%rdi), %xmm0
        vmovups %xmm0, 16(%rdi)

Reply via email to