[Bug target/101059] New: v4sf reduction not optimal

rguenth at gcc dot gnu.org via Gcc-bugs Mon, 14 Jun 2021 02:30:09 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101059


            Bug ID: 101059
           Summary: v4sf reduction not optimal
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

When expanding

float f (v4df v)
{
  _11 = .REDUC_PLUS (v_7(D)); [tail call]
  return _11;
}

thus exercising

(define_expand "reduc_plus_scal_<mode>"
 [(plus:REDUC_SSE_PLUS_MODE
   (match_operand:<ssescalarmode> 0 "register_operand")
   (match_operand:REDUC_SSE_PLUS_MODE 1 "register_operand"))]
 ""
{
  rtx tmp = gen_reg_rtx (<MODE>mode);
  ix86_expand_reduc (gen_add<mode>3, tmp, operands[1]);
  emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp,
                                                        const0_rtx));
  DONE;
})

we see with -Ofast

f:
.LFB0:
        .cfi_startproc
        movaps  %xmm0, %xmm1
        movhlps %xmm0, %xmm1
        addps   %xmm0, %xmm1
        movaps  %xmm1, %xmm0
        shufps  $85, %xmm1, %xmm0
        addps   %xmm1, %xmm0
        ret

https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction

suggests this is a regression from GCC 5 which could reuse a shuffle,
avoiding one movaps

    # gcc 5.3 -O3:  looks optimal
    movaps  xmm1, xmm0     # I think one movaps is unavoidable, unless we have
a 2nd register with known-safe floats in the upper 2 elements
    shufps  xmm1, xmm0, 177
    addps   xmm0, xmm1
    movhlps xmm1, xmm0     # note the reuse of shuf, avoiding a movaps
    addss   xmm0, xmm1

but it also suggests that with SSE3 we can use movshdup which is both
fast and has a write-only destination:

    movshdup    xmm1, xmm0
    addps       xmm0, xmm1
    movhlps     xmm1, xmm0
    addss       xmm0, xmm1

most of ix86_expand_reduc is dead now since expanders narrow down to SSE
before dispatching to it.

[Bug target/101059] New: v4sf reduction not optimal

Reply via email to