https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79291

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
It also looks like mips lacks implementation of any of the vectorizer cost
hooks and thus defaults to default_builtin_vectorization_cost which means that
unaligned loads/stores have double cost.  And mips supports misaligned
loads/stores via movmisalign (for MSA).  For daxpy:

       for (i = 0;i < n; i++) {
                dy[i] = dy[i] + da*dx[i];
        }

the above makes peeling for alignment of dy[] profitable (and I'd generally
agree because esp. misaligned stores do have a real penalty - though likely
not when the store queue is not contended as likely in this case).

x86_64 peels for alignment as well and we get

.L6:
        movups  (%rax,%r8), %xmm1
        addl    $1, %r9d
        mulps   %xmm2, %xmm1
        addps   (%r11,%r8), %xmm1
        movaps  %xmm1, (%r11,%r8)
        addq    $16, %r8
        cmpl    %ebx, %r9d
        jb      .L6

and similar base+index addressing.  IVO does see the indices are the same
though.

  # i_46 = PHI <i_36(7), 0(4)>
  prolog_loop_adjusted_niters.6_48 = (sizetype) prolog_loop_niters.5_34;
  niters.7_49 = niters.3_40 - prolog_loop_niters.5_34;
  bnd.8_69 = niters.7_49 >> 2;
  _75 = prolog_loop_adjusted_niters.6_48 * 4;
  vectp_dy.12_74 = dy_15(D) + _75;
  _80 = prolog_loop_adjusted_niters.6_48 * 4;
  vectp_dx.15_79 = dx_16(D) + _80;
  vect_cst__84 = {da_14(D), da_14(D), da_14(D), da_14(D)};
  _88 = prolog_loop_adjusted_niters.6_48 * 4;
  vectp_dy.20_87 = dy_15(D) + _88;

shows the missed CSE from the vectorizer (and a redundant IV).

During DR analysis we can in theory keep a list of stmts that share the
"same" DR (we have this for group reads already) and record the generated
IVs on the "master" DR.

A region-based CSE/DCE would still be my preference in the end.

Reply via email to