Re: [PATCH] i386: Penalize fold-left FP reduction scalarization by ncopies

Hongtao Liu Tue, 30 Jun 2026 01:25:35 -0700

On Tue, Jun 30, 2026 at 4:00 PM Richard Biener <[email protected]> wrote:
>
> On Tue, 30 Jun 2026, Hongtao Liu wrote:
>
> > On Tue, Jun 30, 2026 at 3:12 PM Richard Biener <[email protected]> wrote:
> > >
> > > On Tue, 30 Jun 2026, hongtao.liu wrote:
> > >
> > > > x86 has no in-order FP reduction instruction, so a FOLD_LEFT FP
> > > > reduction is scalarized into per-lane extracts plus a dependent scalar
> > > > add chain (vect_expand_fold_left in tree-vect-loop.cc).
> > > > vect_model_reduction_cost costs this as ncopies vec_deconstruct ops and
> > > > ncopies * nunits scalar_stmt ops.  That captures the per-lane work but
> > > > not its serialization: the extracts and adds all feed the same scalar
> > > > accumulator, whereas the cost model treats the lanes as independent.
> > >
> > > It really also boils down to how we interpret 'cost'.  As we sum
> > > latency numbers you could view it as a pessimistic overall iteration
> > > latency.  We don't have any idea of throughput in the x86 cost tables
> > > (maybe via reassoc-width in a very ad-hoc way), nor do we attempt to
> > > track the actual dependence chains.
> > >
> > > > The serialization grows with ncopies, since each extra vector copy
> > > > lengthens the same scalar chain.  Track it in a new accumulator
> > > > m_fold_left_latency_penalty: for a loop-body vec_deconstruct or
> > > > scalar_stmt of an FP FOLD_LEFT_REDUCTION with ncopies > 1, add the
> > > > missing (ncopies - 1) * count * stmt_cost.  vec_deconstruct is recorded
> > > > with count == ncopies and scalar_stmt with count == ncopies * nunits, so
> > > > ncopies is recovered as count / nunits for the latter.  finish_cost 
> > > > folds
> > > > the penalty into m_costs[vect_body] once.  With ncopies == 1 there is a
> > > > single copy and no cross-copy chain, so nothing is added.
> > > >
> > > > Without this the cost model can pick a large VF (e.g. VF=32 driven by a
> > > > uint8_t load co-located with float ops) and emit a loop body far larger
> > > > than the scalar loop with no real gain.
> > > >
> > > > The patch reduces codesize(.text) of 731.astcenc_r by 13%/9% for O2/O3,
> > > > codesize of 503.bwaves_r by 7% for O3. Performance impact for SPEC2026
> > > > and SPEC2017 is negligible with O2/O3.
> > > >
> > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{m32,}.
> > > >
> > > > It regressed
> > > > gcc: gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c 
> > > > scan-tree-dump vect "optimized: epilogue loop vectorized using 32 byte 
> > > > vectors"
> > > > gcc: gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c 
> > > > scan-tree-dump vect "optimized: loop vectorized using 64 byte vectors"
> > > > gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect "optimized: 
> > > > epilogue loop vectorized using 32 byte vectors"
> > > > gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect "optimized: 
> > > > loop vectorized using 64 byte vectors"
> > > > gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect "optimized: 
> > > > epilogue loop vectorized using masked 64 byte vectors"
> > > > gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect "optimized: 
> > > > loop vectorized using 64 byte vectors"
> > > > unix/-m32: gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect 
> > > > "optimized: epilogue loop vectorized using 32 byte vectors"
> > > > unix/-m32: gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect 
> > > > "optimized: loop vectorized using 64 byte vectors"
> > > > unix/-m32: gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect 
> > > > "optimized: epilogue loop vectorized using masked 64 byte vectors"
> > > > unix/-m32: gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect 
> > > > "optimized: loop vectorized using 64 byte vectors"
> > > >
> > > > Because they're not vectorized anymore with more penalty added in the 
> > > > patch.
> > > > If the patch makes sense, I'll just adjust those testcases as xfail?
> > >
> > > Rather than XFAILing the indiviual tests need looking at, for example
> > > gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c looks like the
> > > important part was to make sure we are not using a masked epilog.
> > >
> > > > Any comments?
> > >
> > > There is now a new costing hook that could make these kind of
> > > "special cases" easier to track (I hope).  By overloading
> > > ix86_vector_costs::add_slp_cost it should be possible to catch
> > > all pieces of the operation in one go, the deconstruction
> > > and scalar ops.  This might make ::add_stmt_cost easier to
> > > follow.  I did want to experiment with that a bit, but didn't
> > > yet get to that.  Maybe you want to give that a try?
> > Sure.
> >
> > >
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >       * config/i386/i386.cc (ix86_vector_costs): Add
> > > >       m_fold_left_latency_penalty member, initialized to 0.
> > > >       (ix86_vector_costs::add_stmt_cost): Accumulate
> > > >       count * stmt_cost * (ncopies - 1) into
> > > >       m_fold_left_latency_penalty for loop-body vec_deconstruct and
> > > >       scalar_stmt costs of FP FOLD_LEFT_REDUCTION when ncopies > 1.
> > > >       (ix86_vector_costs::finish_cost): Add the accumulated penalty
> > > >       to m_costs[vect_body].
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >       * gcc.target/i386/fold-left-reduc-cost.c: New test.
> > > > ---
> > > >  gcc/config/i386/i386.cc                       | 37 +++++++++++++++++++
> > > >  .../gcc.target/i386/fold-left-reduc-cost.c    | 17 +++++++++
> > > >  2 files changed, 54 insertions(+)
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
> > > >
> > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > index e66958db7ac..6126f265cd3 100644
> > > > --- a/gcc/config/i386/i386.cc
> > > > +++ b/gcc/config/i386/i386.cc
> > > > @@ -26186,6 +26186,8 @@ private:
> > > >    unsigned m_num_avx512_vec_perm[3];
> > > >    /* Number of reductions for FMA/DOT_PROD_EXPR/SAD_EXPR  */
> > > >    unsigned m_num_reduc[X86_REDUC_LAST];
> > > > +  /* Scalarization latency penalty for loop-body fold-left reductions. 
> > > >  */
> > > > +  unsigned m_fold_left_latency_penalty;
> > > >    /* Don't do unroll if m_prefer_unroll is false, default is true.  */
> > > >    bool m_prefer_unroll;
> > > >  };
> > > > @@ -26197,6 +26199,7 @@ ix86_vector_costs::ix86_vector_costs (vec_info* 
> > > > vinfo, bool costing_for_scalar)
> > > >      m_num_avx256_vec_perm (),
> > > >      m_num_avx512_vec_perm (),
> > > >      m_num_reduc (),
> > > > +    m_fold_left_latency_penalty (0),
> > > >      m_prefer_unroll (true)
> > > >  {}
> > > >
> > > > @@ -26826,6 +26829,37 @@ ix86_vector_costs::add_stmt_cost (int count, 
> > > > vect_cost_for_stmt kind,
> > > >    if (stmt_cost == -1)
> > > >      stmt_cost = ix86_default_vector_cost (kind, mode);
> > > >
> > > > +  /* x86 has no in-order FP reduction instruction, so a FOLD_LEFT FP
> > > > +     reduction is scalarized into per-lane extracts plus a dependent 
> > > > scalar
> > > > +     add chain (vect_expand_fold_left).  vect_model_reduction_cost 
> > > > costs
> > > > +     this as ncopies vec_deconstruct ops and ncopies * nunits 
> > > > scalar_stmt
> > > > +     ops, which captures the per-lane work but not its serialization: 
> > > > the
> > > > +     extracts and adds all feed the same scalar accumulator, while the 
> > > > cost
> > > > +     model treats the lanes as independent.  With ncopies copies the 
> > > > chain
> > > > +     is ncopies times longer, so add the missing (ncopies - 1) * count 
> > > > *
> > > > +     stmt_cost for both kinds; finish_cost folds it in once.  
> > > > vec_deconstruct
> > > > +     is recorded with count == ncopies and scalar_stmt with count == 
> > > > ncopies
> > > > +     * nunits, hence the per-kind ncopies below.  */
> > > > +  if ((kind == vec_deconstruct || kind == scalar_stmt)
> > > > +      && where == vect_body
> > > > +      && fp
> > > > +      && !m_costing_for_scalar
> > > > +      && is_a<loop_vec_info> (m_vinfo)
> > > > +      && node
> > > > +      && vect_reduc_type (m_vinfo, node) == FOLD_LEFT_REDUCTION)
> > > > +    {
> > > > +      unsigned ncopies = (unsigned) count;
> > > > +      if (kind == scalar_stmt)
> > > > +     {
> > > > +       unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype);
> > > > +       ncopies = (ncopies + nunits - 1) / nunits;
> > > > +     }
> > > > +      if (ncopies > 1)
> > > > +     m_fold_left_latency_penalty
> > > > +       += adjust_cost_for_freq (stmt_info, where,
> > > > +                                count * stmt_cost * (ncopies - 1));
> > >
> > > Why multiply by count?  How did you arrive at the scaling factor?
> > >
> > The original design only added a linear cost penalty to scalar_stmt,
> > but it still can't prevent vectorization for the testcase in the PR.
> > So I add a quadratic penalty cost of *count*(count * (ncopies - 1)) to
> > both scalar_stmt and vec_deconstruct. It's more like a benchmark
> > number.
>
> OK, so I think it's reasonable to simply prefer a low VF if there's
> any in-order reduction.  Or is the intention to disable vectorization
> completely here?  If the former then rather than these magic scaling
> factors we might want to adjust 
> better_main_loop_than_p/better_epilogue_loop_than_p
> to explicitly state this intent, outside of the rest of the costing?


I think it prefers low VF * ncopies, not VF only.
Basically for ncopies == 1, the patch doesn't add any penalty.
But when ncopies> 1, the penalty is like VF * ncopies, it tries to
prevent *vec_unpack_{lo,hi}_expr +  fold_left_reduc* which increases
code size too much.

  vect_patt_24.18_95 = [vec_unpack_lo_expr] vect_patt_23.15_89;
  vect_patt_24.18_96 = [vec_unpack_hi_expr] vect_patt_23.15_89;
  vect_patt_24.18_97 = [vec_unpack_lo_expr] vect_patt_23.15_90;
  vect_patt_24.18_98 = [vec_unpack_hi_expr] vect_patt_23.15_90;
  vect__9.19_99 = (vector(8) float) vect_patt_24.18_95;
  vect__9.19_100 = (vector(8) float) vect_patt_24.18_96;
  vect__9.19_101 = (vector(8) float) vect_patt_24.18_97;
  vect__9.19_102 = (vector(8) float) vect_patt_24.18_98;
  stmp_sum_16.20_103 = BIT_FIELD_REF <vect__9.19_99, 32, 0>;
  stmp_sum_16.20_104 = sum_20 + stmp_sum_16.20_103;
  stmp_sum_16.20_105 = BIT_FIELD_REF <vect__9.19_99, 32, 32>;
  stmp_sum_16.20_106 = stmp_sum_16.20_104 + stmp_sum_16.20_105;
  stmp_sum_16.20_107 = BIT_FIELD_REF <vect__9.19_99, 32, 64>;
  stmp_sum_16.20_108 = stmp_sum_16.20_106 + stmp_sum_16.20_107;
  stmp_sum_16.20_109 = BIT_FIELD_REF <vect__9.19_99, 32, 96>;
  stmp_sum_16.20_110 = stmp_sum_16.20_108 + stmp_sum_16.20_109;
  stmp_sum_16.20_111 = BIT_FIELD_REF <vect__9.19_99, 32, 128>;
  stmp_sum_16.20_112 = stmp_sum_16.20_110 + stmp_sum_16.20_111;
  stmp_sum_16.20_113 = BIT_FIELD_REF <vect__9.19_99, 32, 160>;
  stmp_sum_16.20_114 = stmp_sum_16.20_112 + stmp_sum_16.20_113;
  stmp_sum_16.20_115 = BIT_FIELD_REF <vect__9.19_99, 32, 192>;
  stmp_sum_16.20_116 = stmp_sum_16.20_114 + stmp_sum_16.20_115;
  stmp_sum_16.20_117 = BIT_FIELD_REF <vect__9.19_99, 32, 224>;
  stmp_sum_16.20_118 = stmp_sum_16.20_116 + stmp_sum_16.20_117;
  stmp_sum_16.20_119 = BIT_FIELD_REF <vect__9.19_100, 32, 0>;
  stmp_sum_16.20_120 = stmp_sum_16.20_118 + stmp_sum_16.20_119;
  stmp_sum_16.20_121 = BIT_FIELD_REF <vect__9.19_100, 32, 32>;
  stmp_sum_16.20_122 = stmp_sum_16.20_120 + stmp_sum_16.20_121;
  stmp_sum_16.20_123 = BIT_FIELD_REF <vect__9.19_100, 32, 64>;
  stmp_sum_16.20_124 = stmp_sum_16.20_122 + stmp_sum_16.20_123;
  stmp_sum_16.20_125 = BIT_FIELD_REF <vect__9.19_100, 32, 96>;
  stmp_sum_16.20_126 = stmp_sum_16.20_124 + stmp_sum_16.20_125;
  stmp_sum_16.20_127 = BIT_FIELD_REF <vect__9.19_100, 32, 128>;
  stmp_sum_16.20_128 = stmp_sum_16.20_126 + stmp_sum_16.20_127;
  stmp_sum_16.20_129 = BIT_FIELD_REF <vect__9.19_100, 32, 160>;
  stmp_sum_16.20_130 = stmp_sum_16.20_128 + stmp_sum_16.20_129;
  stmp_sum_16.20_131 = BIT_FIELD_REF <vect__9.19_100, 32, 192>;
  stmp_sum_16.20_132 = stmp_sum_16.20_130 + stmp_sum_16.20_131;
  stmp_sum_16.20_133 = BIT_FIELD_REF <vect__9.19_100, 32, 224>;
  stmp_sum_16.20_134 = stmp_sum_16.20_132 + stmp_sum_16.20_133;
  stmp_sum_16.20_135 = BIT_FIELD_REF <vect__9.19_101, 32, 0>;
  stmp_sum_16.20_136 = stmp_sum_16.20_134 + stmp_sum_16.20_135;
  stmp_sum_16.20_137 = BIT_FIELD_REF <vect__9.19_101, 32, 32>;
  stmp_sum_16.20_138 = stmp_sum_16.20_136 + stmp_sum_16.20_137;
  stmp_sum_16.20_139 = BIT_FIELD_REF <vect__9.19_101, 32, 64>;
  stmp_sum_16.20_140 = stmp_sum_16.20_138 + stmp_sum_16.20_139;
  stmp_sum_16.20_141 = BIT_FIELD_REF <vect__9.19_101, 32, 96>;
  stmp_sum_16.20_142 = stmp_sum_16.20_140 + stmp_sum_16.20_141;
  stmp_sum_16.20_143 = BIT_FIELD_REF <vect__9.19_101, 32, 128>;
  stmp_sum_16.20_144 = stmp_sum_16.20_142 + stmp_sum_16.20_143;
  stmp_sum_16.20_145 = BIT_FIELD_REF <vect__9.19_101, 32, 160>;
  stmp_sum_16.20_146 = stmp_sum_16.20_144 + stmp_sum_16.20_145;
  stmp_sum_16.20_147 = BIT_FIELD_REF <vect__9.19_101, 32, 192>;
  stmp_sum_16.20_148 = stmp_sum_16.20_146 + stmp_sum_16.20_147;
  stmp_sum_16.20_149 = BIT_FIELD_REF <vect__9.19_101, 32, 224>;
  stmp_sum_16.20_150 = stmp_sum_16.20_148 + stmp_sum_16.20_149;
  stmp_sum_16.20_151 = BIT_FIELD_REF <vect__9.19_102, 32, 0>;
  stmp_sum_16.20_152 = stmp_sum_16.20_150 + stmp_sum_16.20_151;
  stmp_sum_16.20_153 = BIT_FIELD_REF <vect__9.19_102, 32, 32>;
  stmp_sum_16.20_154 = stmp_sum_16.20_152 + stmp_sum_16.20_153;
  stmp_sum_16.20_155 = BIT_FIELD_REF <vect__9.19_102, 32, 64>;
  stmp_sum_16.20_156 = stmp_sum_16.20_154 + stmp_sum_16.20_155;
  stmp_sum_16.20_157 = BIT_FIELD_REF <vect__9.19_102, 32, 96>;
  stmp_sum_16.20_158 = stmp_sum_16.20_156 + stmp_sum_16.20_157;
  stmp_sum_16.20_159 = BIT_FIELD_REF <vect__9.19_102, 32, 128>;
  stmp_sum_16.20_160 = stmp_sum_16.20_158 + stmp_sum_16.20_159;
  stmp_sum_16.20_161 = BIT_FIELD_REF <vect__9.19_102, 32, 160>;
  stmp_sum_16.20_162 = stmp_sum_16.20_160 + stmp_sum_16.20_161;
  stmp_sum_16.20_163 = BIT_FIELD_REF <vect__9.19_102, 32, 192>;
  stmp_sum_16.20_164 = stmp_sum_16.20_162 + stmp_sum_16.20_163;
  stmp_sum_16.20_165 = BIT_FIELD_REF <vect__9.19_102, 32, 224>;
  sum_16 = stmp_sum_16.20_164 + stmp_sum_16.20_165;



>
> > > So you are penaltizing the deconstruction and the scalar op the same?
> > > The 'cost' for the deconstruction is the overall work, it's latency
> > > is lower when there's infinite throughput and in particular we do
> > > not have to wait for all parts to be ready but can interleave the
> > > scalar operations with the deconstructed lanes arriving, improving
> > > instruction issue and avoiding throughput bottlenecks.
> > >
> > > That said, I know we're doing the above, but we're mixing
> > > latency and number-of-overall-ops (which is what 'cost' is accumulating)
> > Yes, I think we already mix latency and throughput for the vectorized stmt 
> > cost.
> > >
> > > > +    }
> > > > +
> > > >    /* BIT_FIELD_REF <vect_**, 64, 0> with count 0 costs 0 in body.  */
> > > >    if (kind == vec_perm && vectype && count != 0)
> > > >      {
> > > > @@ -26960,6 +26994,9 @@ ix86_vector_costs::finish_cost (const 
> > > > vector_costs *scalar_costs)
> > > >
> > > >      }
> > > >
> > > > +  if (m_fold_left_latency_penalty && m_costs[vect_body] != INT_MAX)
> > > > +    m_costs[vect_body] += m_fold_left_latency_penalty;
> > > > +
> > >
> > > So you are just adding that here - why delay that?  This way it doesn't
> > > appear in any of the dumps.  An alternative might be to just track
> > > whether there is a fold-left reduction (or how many "SSE bits" of it)
> > > and use that as a "tie breaker" in
> > > better_main_loop_than_p/better_epilogue_loop_than_p?
> > >
> > > IMO we might want to (again, did so a few years ago without success) try
> > > to build a dependence DAG with uops we create during costing and at
> > > finish_cost time compute overall latency.  When I attempted this I
> > > made this _very_ simple, assuming infinite throughput and likely too
> > > coarse grained "ops" - it was as much GIGO as the current scheme.
> > I can let AI play with it to see if we can have some findings ;)
>
> Heh, wish you luck.  I didn't yet have much of that when trying AI
> for prototyping of ideas ;)
>
> Richard.
>
> >
> > >
> > > Thanks,
> > > Richard.
> > >
> > > >    ix86_vect_estimate_reg_pressure ();
> > > >
> > > >    for (int i = 0; i != 3; i++)
> > > > diff --git a/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c 
> > > > b/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
> > > > new file mode 100644
> > > > index 00000000000..5f26f968429
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
> > > > @@ -0,0 +1,17 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -march=x86-64-v3 -fdump-tree-vect-details" } */
> > > > +/* { dg-final { scan-tree-dump "vectorized 0 loops in function" "vect" 
> > > > } } */
> > > > +
> > > > +/* Loop driven by a uint8_t load co-located with float ops makes the
> > > > +   vectorizer pick a large VF (e.g. VF=32) so that ncopies > 1 for the
> > > > +   FOLD_LEFT FP reduction.  The per-copy scalar-add chain serialization
> > > > +   should make vectorization unprofitable.  */
> > > > +
> > > > +float
> > > > +foo (char *a, char *b, int n)
> > > > +{
> > > > +  float sum = 0;
> > > > +  for (int i = 0; i != n; i++)
> > > > +    sum += a[i] * b[i];
> > > > +  return sum;
> > > > +}
> > > >
> > >
> > > --
> > > Richard Biener <[email protected]>
> > > SUSE Software Solutions Germany GmbH,
> > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG 
> > > Nuernberg)
> >
> >
> >
> >
>
> --
> Richard Biener <[email protected]>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)



-- 
BR,
Hongtao

Re: [PATCH] i386: Penalize fold-left FP reduction scalarization by ncopies

Reply via email to