On Tue, Jun 30, 2026 at 3:12 PM Richard Biener <[email protected]> wrote:
>
> On Tue, 30 Jun 2026, hongtao.liu wrote:
>
> > x86 has no in-order FP reduction instruction, so a FOLD_LEFT FP
> > reduction is scalarized into per-lane extracts plus a dependent scalar
> > add chain (vect_expand_fold_left in tree-vect-loop.cc).
> > vect_model_reduction_cost costs this as ncopies vec_deconstruct ops and
> > ncopies * nunits scalar_stmt ops. That captures the per-lane work but
> > not its serialization: the extracts and adds all feed the same scalar
> > accumulator, whereas the cost model treats the lanes as independent.
>
> It really also boils down to how we interpret 'cost'. As we sum
> latency numbers you could view it as a pessimistic overall iteration
> latency. We don't have any idea of throughput in the x86 cost tables
> (maybe via reassoc-width in a very ad-hoc way), nor do we attempt to
> track the actual dependence chains.
>
> > The serialization grows with ncopies, since each extra vector copy
> > lengthens the same scalar chain. Track it in a new accumulator
> > m_fold_left_latency_penalty: for a loop-body vec_deconstruct or
> > scalar_stmt of an FP FOLD_LEFT_REDUCTION with ncopies > 1, add the
> > missing (ncopies - 1) * count * stmt_cost. vec_deconstruct is recorded
> > with count == ncopies and scalar_stmt with count == ncopies * nunits, so
> > ncopies is recovered as count / nunits for the latter. finish_cost folds
> > the penalty into m_costs[vect_body] once. With ncopies == 1 there is a
> > single copy and no cross-copy chain, so nothing is added.
> >
> > Without this the cost model can pick a large VF (e.g. VF=32 driven by a
> > uint8_t load co-located with float ops) and emit a loop body far larger
> > than the scalar loop with no real gain.
> >
> > The patch reduces codesize(.text) of 731.astcenc_r by 13%/9% for O2/O3,
> > codesize of 503.bwaves_r by 7% for O3. Performance impact for SPEC2026
> > and SPEC2017 is negligible with O2/O3.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{m32,}.
> >
> > It regressed
> > gcc: gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c scan-tree-dump
> > vect "optimized: epilogue loop vectorized using 32 byte vectors"
> > gcc: gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c scan-tree-dump
> > vect "optimized: loop vectorized using 64 byte vectors"
> > gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect "optimized:
> > epilogue loop vectorized using 32 byte vectors"
> > gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect "optimized:
> > loop vectorized using 64 byte vectors"
> > gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect "optimized:
> > epilogue loop vectorized using masked 64 byte vectors"
> > gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect "optimized:
> > loop vectorized using 64 byte vectors"
> > unix/-m32: gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect
> > "optimized: epilogue loop vectorized using 32 byte vectors"
> > unix/-m32: gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect
> > "optimized: loop vectorized using 64 byte vectors"
> > unix/-m32: gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect
> > "optimized: epilogue loop vectorized using masked 64 byte vectors"
> > unix/-m32: gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect
> > "optimized: loop vectorized using 64 byte vectors"
> >
> > Because they're not vectorized anymore with more penalty added in the patch.
> > If the patch makes sense, I'll just adjust those testcases as xfail?
>
> Rather than XFAILing the indiviual tests need looking at, for example
> gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c looks like the
> important part was to make sure we are not using a masked epilog.
>
> > Any comments?
>
> There is now a new costing hook that could make these kind of
> "special cases" easier to track (I hope). By overloading
> ix86_vector_costs::add_slp_cost it should be possible to catch
> all pieces of the operation in one go, the deconstruction
> and scalar ops. This might make ::add_stmt_cost easier to
> follow. I did want to experiment with that a bit, but didn't
> yet get to that. Maybe you want to give that a try?
Sure.
>
> >
> > gcc/ChangeLog:
> >
> > * config/i386/i386.cc (ix86_vector_costs): Add
> > m_fold_left_latency_penalty member, initialized to 0.
> > (ix86_vector_costs::add_stmt_cost): Accumulate
> > count * stmt_cost * (ncopies - 1) into
> > m_fold_left_latency_penalty for loop-body vec_deconstruct and
> > scalar_stmt costs of FP FOLD_LEFT_REDUCTION when ncopies > 1.
> > (ix86_vector_costs::finish_cost): Add the accumulated penalty
> > to m_costs[vect_body].
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/fold-left-reduc-cost.c: New test.
> > ---
> > gcc/config/i386/i386.cc | 37 +++++++++++++++++++
> > .../gcc.target/i386/fold-left-reduc-cost.c | 17 +++++++++
> > 2 files changed, 54 insertions(+)
> > create mode 100644 gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index e66958db7ac..6126f265cd3 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -26186,6 +26186,8 @@ private:
> > unsigned m_num_avx512_vec_perm[3];
> > /* Number of reductions for FMA/DOT_PROD_EXPR/SAD_EXPR */
> > unsigned m_num_reduc[X86_REDUC_LAST];
> > + /* Scalarization latency penalty for loop-body fold-left reductions. */
> > + unsigned m_fold_left_latency_penalty;
> > /* Don't do unroll if m_prefer_unroll is false, default is true. */
> > bool m_prefer_unroll;
> > };
> > @@ -26197,6 +26199,7 @@ ix86_vector_costs::ix86_vector_costs (vec_info*
> > vinfo, bool costing_for_scalar)
> > m_num_avx256_vec_perm (),
> > m_num_avx512_vec_perm (),
> > m_num_reduc (),
> > + m_fold_left_latency_penalty (0),
> > m_prefer_unroll (true)
> > {}
> >
> > @@ -26826,6 +26829,37 @@ ix86_vector_costs::add_stmt_cost (int count,
> > vect_cost_for_stmt kind,
> > if (stmt_cost == -1)
> > stmt_cost = ix86_default_vector_cost (kind, mode);
> >
> > + /* x86 has no in-order FP reduction instruction, so a FOLD_LEFT FP
> > + reduction is scalarized into per-lane extracts plus a dependent scalar
> > + add chain (vect_expand_fold_left). vect_model_reduction_cost costs
> > + this as ncopies vec_deconstruct ops and ncopies * nunits scalar_stmt
> > + ops, which captures the per-lane work but not its serialization: the
> > + extracts and adds all feed the same scalar accumulator, while the cost
> > + model treats the lanes as independent. With ncopies copies the chain
> > + is ncopies times longer, so add the missing (ncopies - 1) * count *
> > + stmt_cost for both kinds; finish_cost folds it in once.
> > vec_deconstruct
> > + is recorded with count == ncopies and scalar_stmt with count ==
> > ncopies
> > + * nunits, hence the per-kind ncopies below. */
> > + if ((kind == vec_deconstruct || kind == scalar_stmt)
> > + && where == vect_body
> > + && fp
> > + && !m_costing_for_scalar
> > + && is_a<loop_vec_info> (m_vinfo)
> > + && node
> > + && vect_reduc_type (m_vinfo, node) == FOLD_LEFT_REDUCTION)
> > + {
> > + unsigned ncopies = (unsigned) count;
> > + if (kind == scalar_stmt)
> > + {
> > + unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype);
> > + ncopies = (ncopies + nunits - 1) / nunits;
> > + }
> > + if (ncopies > 1)
> > + m_fold_left_latency_penalty
> > + += adjust_cost_for_freq (stmt_info, where,
> > + count * stmt_cost * (ncopies - 1));
>
> Why multiply by count? How did you arrive at the scaling factor?
>
The original design only added a linear cost penalty to scalar_stmt,
but it still can't prevent vectorization for the testcase in the PR.
So I add a quadratic penalty cost of *count*(count * (ncopies - 1)) to
both scalar_stmt and vec_deconstruct. It's more like a benchmark
number.
> So you are penaltizing the deconstruction and the scalar op the same?
> The 'cost' for the deconstruction is the overall work, it's latency
> is lower when there's infinite throughput and in particular we do
> not have to wait for all parts to be ready but can interleave the
> scalar operations with the deconstructed lanes arriving, improving
> instruction issue and avoiding throughput bottlenecks.
>
> That said, I know we're doing the above, but we're mixing
> latency and number-of-overall-ops (which is what 'cost' is accumulating)
Yes, I think we already mix latency and throughput for the vectorized stmt cost.
>
> > + }
> > +
> > /* BIT_FIELD_REF <vect_**, 64, 0> with count 0 costs 0 in body. */
> > if (kind == vec_perm && vectype && count != 0)
> > {
> > @@ -26960,6 +26994,9 @@ ix86_vector_costs::finish_cost (const vector_costs
> > *scalar_costs)
> >
> > }
> >
> > + if (m_fold_left_latency_penalty && m_costs[vect_body] != INT_MAX)
> > + m_costs[vect_body] += m_fold_left_latency_penalty;
> > +
>
> So you are just adding that here - why delay that? This way it doesn't
> appear in any of the dumps. An alternative might be to just track
> whether there is a fold-left reduction (or how many "SSE bits" of it)
> and use that as a "tie breaker" in
> better_main_loop_than_p/better_epilogue_loop_than_p?
>
> IMO we might want to (again, did so a few years ago without success) try
> to build a dependence DAG with uops we create during costing and at
> finish_cost time compute overall latency. When I attempted this I
> made this _very_ simple, assuming infinite throughput and likely too
> coarse grained "ops" - it was as much GIGO as the current scheme.
I can let AI play with it to see if we can have some findings ;)
>
> Thanks,
> Richard.
>
> > ix86_vect_estimate_reg_pressure ();
> >
> > for (int i = 0; i != 3; i++)
> > diff --git a/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
> > b/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
> > new file mode 100644
> > index 00000000000..5f26f968429
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=x86-64-v3 -fdump-tree-vect-details" } */
> > +/* { dg-final { scan-tree-dump "vectorized 0 loops in function" "vect" } }
> > */
> > +
> > +/* Loop driven by a uint8_t load co-located with float ops makes the
> > + vectorizer pick a large VF (e.g. VF=32) so that ncopies > 1 for the
> > + FOLD_LEFT FP reduction. The per-copy scalar-add chain serialization
> > + should make vectorization unprofitable. */
> > +
> > +float
> > +foo (char *a, char *b, int n)
> > +{
> > + float sum = 0;
> > + for (int i = 0; i != n; i++)
> > + sum += a[i] * b[i];
> > + return sum;
> > +}
> >
>
> --
> Richard Biener <[email protected]>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
--
BR,
Hongtao