Re: [PATCH] i386: Penalize fold-left FP reduction scalarization by ncopies

Richard Biener Tue, 30 Jun 2026 00:12:34 -0700

On Tue, 30 Jun 2026, hongtao.liu wrote:

> x86 has no in-order FP reduction instruction, so a FOLD_LEFT FP
> reduction is scalarized into per-lane extracts plus a dependent scalar
> add chain (vect_expand_fold_left in tree-vect-loop.cc).
> vect_model_reduction_cost costs this as ncopies vec_deconstruct ops and
> ncopies * nunits scalar_stmt ops.  That captures the per-lane work but
> not its serialization: the extracts and adds all feed the same scalar
> accumulator, whereas the cost model treats the lanes as independent.


It really also boils down to how we interpret 'cost'.  As we sum
latency numbers you could view it as a pessimistic overall iteration
latency.  We don't have any idea of throughput in the x86 cost tables
(maybe via reassoc-width in a very ad-hoc way), nor do we attempt to
track the actual dependence chains.

> The serialization grows with ncopies, since each extra vector copy
> lengthens the same scalar chain.  Track it in a new accumulator
> m_fold_left_latency_penalty: for a loop-body vec_deconstruct or
> scalar_stmt of an FP FOLD_LEFT_REDUCTION with ncopies > 1, add the
> missing (ncopies - 1) * count * stmt_cost.  vec_deconstruct is recorded
> with count == ncopies and scalar_stmt with count == ncopies * nunits, so
> ncopies is recovered as count / nunits for the latter.  finish_cost folds
> the penalty into m_costs[vect_body] once.  With ncopies == 1 there is a
> single copy and no cross-copy chain, so nothing is added.
> 
> Without this the cost model can pick a large VF (e.g. VF=32 driven by a
> uint8_t load co-located with float ops) and emit a loop body far larger
> than the scalar loop with no real gain.
> 
> The patch reduces codesize(.text) of 731.astcenc_r by 13%/9% for O2/O3,
> codesize of 503.bwaves_r by 7% for O3. Performance impact for SPEC2026
> and SPEC2017 is negligible with O2/O3.
> 
> Bootstrapped and regtested on x86_64-pc-linux-gnu{m32,}.
> 
> It regressed
> gcc: gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c scan-tree-dump vect 
> "optimized: epilogue loop vectorized using 32 byte vectors"
> gcc: gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c scan-tree-dump vect 
> "optimized: loop vectorized using 64 byte vectors"
> gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect "optimized: 
> epilogue loop vectorized using 32 byte vectors"
> gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect "optimized: loop 
> vectorized using 64 byte vectors"
> gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect "optimized: 
> epilogue loop vectorized using masked 64 byte vectors"
> gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect "optimized: loop 
> vectorized using 64 byte vectors"
> unix/-m32: gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect 
> "optimized: epilogue loop vectorized using 32 byte vectors"
> unix/-m32: gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect 
> "optimized: loop vectorized using 64 byte vectors"
> unix/-m32: gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect 
> "optimized: epilogue loop vectorized using masked 64 byte vectors"
> unix/-m32: gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect 
> "optimized: loop vectorized using 64 byte vectors"
> 
> Because they're not vectorized anymore with more penalty added in the patch.
> If the patch makes sense, I'll just adjust those testcases as xfail?

Rather than XFAILing the indiviual tests need looking at, for example
gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c looks like the
important part was to make sure we are not using a masked epilog.
 
> Any comments?

There is now a new costing hook that could make these kind of
"special cases" easier to track (I hope).  By overloading
ix86_vector_costs::add_slp_cost it should be possible to catch
all pieces of the operation in one go, the deconstruction
and scalar ops.  This might make ::add_stmt_cost easier to
follow.  I did want to experiment with that a bit, but didn't
yet get to that.  Maybe you want to give that a try?

> 
> gcc/ChangeLog:
> 
>       * config/i386/i386.cc (ix86_vector_costs): Add
>       m_fold_left_latency_penalty member, initialized to 0.
>       (ix86_vector_costs::add_stmt_cost): Accumulate
>       count * stmt_cost * (ncopies - 1) into
>       m_fold_left_latency_penalty for loop-body vec_deconstruct and
>       scalar_stmt costs of FP FOLD_LEFT_REDUCTION when ncopies > 1.
>       (ix86_vector_costs::finish_cost): Add the accumulated penalty
>       to m_costs[vect_body].
> 
> gcc/testsuite/ChangeLog:
> 
>       * gcc.target/i386/fold-left-reduc-cost.c: New test.
> ---
>  gcc/config/i386/i386.cc                       | 37 +++++++++++++++++++
>  .../gcc.target/i386/fold-left-reduc-cost.c    | 17 +++++++++
>  2 files changed, 54 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
> 
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index e66958db7ac..6126f265cd3 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -26186,6 +26186,8 @@ private:
>    unsigned m_num_avx512_vec_perm[3];
>    /* Number of reductions for FMA/DOT_PROD_EXPR/SAD_EXPR  */
>    unsigned m_num_reduc[X86_REDUC_LAST];
> +  /* Scalarization latency penalty for loop-body fold-left reductions.  */
> +  unsigned m_fold_left_latency_penalty;
>    /* Don't do unroll if m_prefer_unroll is false, default is true.  */
>    bool m_prefer_unroll;
>  };
> @@ -26197,6 +26199,7 @@ ix86_vector_costs::ix86_vector_costs (vec_info* 
> vinfo, bool costing_for_scalar)
>      m_num_avx256_vec_perm (),
>      m_num_avx512_vec_perm (),
>      m_num_reduc (),
> +    m_fold_left_latency_penalty (0),
>      m_prefer_unroll (true)
>  {}
>  
> @@ -26826,6 +26829,37 @@ ix86_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>    if (stmt_cost == -1)
>      stmt_cost = ix86_default_vector_cost (kind, mode);
>  
> +  /* x86 has no in-order FP reduction instruction, so a FOLD_LEFT FP
> +     reduction is scalarized into per-lane extracts plus a dependent scalar
> +     add chain (vect_expand_fold_left).  vect_model_reduction_cost costs
> +     this as ncopies vec_deconstruct ops and ncopies * nunits scalar_stmt
> +     ops, which captures the per-lane work but not its serialization: the
> +     extracts and adds all feed the same scalar accumulator, while the cost
> +     model treats the lanes as independent.  With ncopies copies the chain
> +     is ncopies times longer, so add the missing (ncopies - 1) * count *
> +     stmt_cost for both kinds; finish_cost folds it in once.  vec_deconstruct
> +     is recorded with count == ncopies and scalar_stmt with count == ncopies
> +     * nunits, hence the per-kind ncopies below.  */
> +  if ((kind == vec_deconstruct || kind == scalar_stmt)
> +      && where == vect_body
> +      && fp
> +      && !m_costing_for_scalar
> +      && is_a<loop_vec_info> (m_vinfo)
> +      && node
> +      && vect_reduc_type (m_vinfo, node) == FOLD_LEFT_REDUCTION)
> +    {
> +      unsigned ncopies = (unsigned) count;
> +      if (kind == scalar_stmt)
> +     {
> +       unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype);
> +       ncopies = (ncopies + nunits - 1) / nunits;
> +     }
> +      if (ncopies > 1)
> +     m_fold_left_latency_penalty
> +       += adjust_cost_for_freq (stmt_info, where,
> +                                count * stmt_cost * (ncopies - 1));

Why multiply by count?  How did you arrive at the scaling factor?

So you are penaltizing the deconstruction and the scalar op the same?
The 'cost' for the deconstruction is the overall work, it's latency
is lower when there's infinite throughput and in particular we do
not have to wait for all parts to be ready but can interleave the
scalar operations with the deconstructed lanes arriving, improving
instruction issue and avoiding throughput bottlenecks.

That said, I know we're doing the above, but we're mixing
latency and number-of-overall-ops (which is what 'cost' is accumulating).

> +    }
> +
>    /* BIT_FIELD_REF <vect_**, 64, 0> with count 0 costs 0 in body.  */
>    if (kind == vec_perm && vectype && count != 0)
>      {
> @@ -26960,6 +26994,9 @@ ix86_vector_costs::finish_cost (const vector_costs 
> *scalar_costs)
>  
>      }
>  
> +  if (m_fold_left_latency_penalty && m_costs[vect_body] != INT_MAX)
> +    m_costs[vect_body] += m_fold_left_latency_penalty;
> +

So you are just adding that here - why delay that?  This way it doesn't
appear in any of the dumps.  An alternative might be to just track
whether there is a fold-left reduction (or how many "SSE bits" of it)
and use that as a "tie breaker" in 
better_main_loop_than_p/better_epilogue_loop_than_p?

IMO we might want to (again, did so a few years ago without success) try
to build a dependence DAG with uops we create during costing and at
finish_cost time compute overall latency.  When I attempted this I
made this _very_ simple, assuming infinite throughput and likely too
coarse grained "ops" - it was as much GIGO as the current scheme.

Thanks,
Richard.

>    ix86_vect_estimate_reg_pressure ();
>  
>    for (int i = 0; i != 3; i++)
> diff --git a/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c 
> b/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
> new file mode 100644
> index 00000000000..5f26f968429
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=x86-64-v3 -fdump-tree-vect-details" } */
> +/* { dg-final { scan-tree-dump "vectorized 0 loops in function" "vect" } } */
> +
> +/* Loop driven by a uint8_t load co-located with float ops makes the
> +   vectorizer pick a large VF (e.g. VF=32) so that ncopies > 1 for the
> +   FOLD_LEFT FP reduction.  The per-copy scalar-add chain serialization
> +   should make vectorization unprofitable.  */
> +
> +float
> +foo (char *a, char *b, int n)
> +{
> +  float sum = 0;
> +  for (int i = 0; i != n; i++)
> +    sum += a[i] * b[i];
> +  return sum;
> +}
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH] i386: Penalize fold-left FP reduction scalarization by ncopies

Reply via email to