https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101059
Bug ID: 101059 Summary: v4sf reduction not optimal Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rguenth at gcc dot gnu.org Target Milestone: --- When expanding float f (v4df v) { _11 = .REDUC_PLUS (v_7(D)); [tail call] return _11; } thus exercising (define_expand "reduc_plus_scal_<mode>" [(plus:REDUC_SSE_PLUS_MODE (match_operand:<ssescalarmode> 0 "register_operand") (match_operand:REDUC_SSE_PLUS_MODE 1 "register_operand"))] "" { rtx tmp = gen_reg_rtx (<MODE>mode); ix86_expand_reduc (gen_add<mode>3, tmp, operands[1]); emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp, const0_rtx)); DONE; }) we see with -Ofast f: .LFB0: .cfi_startproc movaps %xmm0, %xmm1 movhlps %xmm0, %xmm1 addps %xmm0, %xmm1 movaps %xmm1, %xmm0 shufps $85, %xmm1, %xmm0 addps %xmm1, %xmm0 ret https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction suggests this is a regression from GCC 5 which could reuse a shuffle, avoiding one movaps # gcc 5.3 -O3: looks optimal movaps xmm1, xmm0 # I think one movaps is unavoidable, unless we have a 2nd register with known-safe floats in the upper 2 elements shufps xmm1, xmm0, 177 addps xmm0, xmm1 movhlps xmm1, xmm0 # note the reuse of shuf, avoiding a movaps addss xmm0, xmm1 but it also suggests that with SSE3 we can use movshdup which is both fast and has a write-only destination: movshdup xmm1, xmm0 addps xmm0, xmm1 movhlps xmm1, xmm0 addss xmm0, xmm1 most of ix86_expand_reduc is dead now since expanders narrow down to SSE before dispatching to it.