On Tue, Jun 30, 2026 at 4:32 PM Richard Biener <[email protected]> wrote: > > On Tue, 30 Jun 2026, Hongtao Liu wrote: > > > On Tue, Jun 30, 2026 at 4:00 PM Richard Biener <[email protected]> wrote: > > > > > > On Tue, 30 Jun 2026, Hongtao Liu wrote: > > > > > > > On Tue, Jun 30, 2026 at 3:12 PM Richard Biener <[email protected]> > > > > wrote: > > > > > > > > > > On Tue, 30 Jun 2026, hongtao.liu wrote: > > > > > > > > > > > x86 has no in-order FP reduction instruction, so a FOLD_LEFT FP > > > > > > reduction is scalarized into per-lane extracts plus a dependent > > > > > > scalar > > > > > > add chain (vect_expand_fold_left in tree-vect-loop.cc). > > > > > > vect_model_reduction_cost costs this as ncopies vec_deconstruct ops > > > > > > and > > > > > > ncopies * nunits scalar_stmt ops. That captures the per-lane work > > > > > > but > > > > > > not its serialization: the extracts and adds all feed the same > > > > > > scalar > > > > > > accumulator, whereas the cost model treats the lanes as independent. > > > > > > > > > > It really also boils down to how we interpret 'cost'. As we sum > > > > > latency numbers you could view it as a pessimistic overall iteration > > > > > latency. We don't have any idea of throughput in the x86 cost tables > > > > > (maybe via reassoc-width in a very ad-hoc way), nor do we attempt to > > > > > track the actual dependence chains. > > > > > > > > > > > The serialization grows with ncopies, since each extra vector copy > > > > > > lengthens the same scalar chain. Track it in a new accumulator > > > > > > m_fold_left_latency_penalty: for a loop-body vec_deconstruct or > > > > > > scalar_stmt of an FP FOLD_LEFT_REDUCTION with ncopies > 1, add the > > > > > > missing (ncopies - 1) * count * stmt_cost. vec_deconstruct is > > > > > > recorded > > > > > > with count == ncopies and scalar_stmt with count == ncopies * > > > > > > nunits, so > > > > > > ncopies is recovered as count / nunits for the latter. finish_cost > > > > > > folds > > > > > > the penalty into m_costs[vect_body] once. With ncopies == 1 there > > > > > > is a > > > > > > single copy and no cross-copy chain, so nothing is added. > > > > > > > > > > > > Without this the cost model can pick a large VF (e.g. VF=32 driven > > > > > > by a > > > > > > uint8_t load co-located with float ops) and emit a loop body far > > > > > > larger > > > > > > than the scalar loop with no real gain. > > > > > > > > > > > > The patch reduces codesize(.text) of 731.astcenc_r by 13%/9% for > > > > > > O2/O3, > > > > > > codesize of 503.bwaves_r by 7% for O3. Performance impact for > > > > > > SPEC2026 > > > > > > and SPEC2017 is negligible with O2/O3. > > > > > > > > > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{m32,}. > > > > > > > > > > > > It regressed > > > > > > gcc: gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c > > > > > > scan-tree-dump vect "optimized: epilogue loop vectorized using 32 > > > > > > byte vectors" > > > > > > gcc: gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c > > > > > > scan-tree-dump vect "optimized: loop vectorized using 64 byte > > > > > > vectors" > > > > > > gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect > > > > > > "optimized: epilogue loop vectorized using 32 byte vectors" > > > > > > gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect > > > > > > "optimized: loop vectorized using 64 byte vectors" > > > > > > gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect > > > > > > "optimized: epilogue loop vectorized using masked 64 byte vectors" > > > > > > gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect > > > > > > "optimized: loop vectorized using 64 byte vectors" > > > > > > unix/-m32: gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump > > > > > > vect "optimized: epilogue loop vectorized using 32 byte vectors" > > > > > > unix/-m32: gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump > > > > > > vect "optimized: loop vectorized using 64 byte vectors" > > > > > > unix/-m32: gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump > > > > > > vect "optimized: epilogue loop vectorized using masked 64 byte > > > > > > vectors" > > > > > > unix/-m32: gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump > > > > > > vect "optimized: loop vectorized using 64 byte vectors" > > > > > > > > > > > > Because they're not vectorized anymore with more penalty added in > > > > > > the patch. > > > > > > If the patch makes sense, I'll just adjust those testcases as xfail? > > > > > > > > > > Rather than XFAILing the indiviual tests need looking at, for example > > > > > gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c looks like the > > > > > important part was to make sure we are not using a masked epilog. > > > > > > > > > > > Any comments? > > > > > > > > > > There is now a new costing hook that could make these kind of > > > > > "special cases" easier to track (I hope). By overloading > > > > > ix86_vector_costs::add_slp_cost it should be possible to catch > > > > > all pieces of the operation in one go, the deconstruction > > > > > and scalar ops. This might make ::add_stmt_cost easier to > > > > > follow. I did want to experiment with that a bit, but didn't > > > > > yet get to that. Maybe you want to give that a try? > > > > Sure. > > > > > > > > > > > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > > > * config/i386/i386.cc (ix86_vector_costs): Add > > > > > > m_fold_left_latency_penalty member, initialized to 0. > > > > > > (ix86_vector_costs::add_stmt_cost): Accumulate > > > > > > count * stmt_cost * (ncopies - 1) into > > > > > > m_fold_left_latency_penalty for loop-body vec_deconstruct and > > > > > > scalar_stmt costs of FP FOLD_LEFT_REDUCTION when ncopies > 1. > > > > > > (ix86_vector_costs::finish_cost): Add the accumulated penalty > > > > > > to m_costs[vect_body]. > > > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > > > * gcc.target/i386/fold-left-reduc-cost.c: New test. > > > > > > --- > > > > > > gcc/config/i386/i386.cc | 37 > > > > > > +++++++++++++++++++ > > > > > > .../gcc.target/i386/fold-left-reduc-cost.c | 17 +++++++++ > > > > > > 2 files changed, 54 insertions(+) > > > > > > create mode 100644 > > > > > > gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c > > > > > > > > > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > > > > > index e66958db7ac..6126f265cd3 100644 > > > > > > --- a/gcc/config/i386/i386.cc > > > > > > +++ b/gcc/config/i386/i386.cc > > > > > > @@ -26186,6 +26186,8 @@ private: > > > > > > unsigned m_num_avx512_vec_perm[3]; > > > > > > /* Number of reductions for FMA/DOT_PROD_EXPR/SAD_EXPR */ > > > > > > unsigned m_num_reduc[X86_REDUC_LAST]; > > > > > > + /* Scalarization latency penalty for loop-body fold-left > > > > > > reductions. */ > > > > > > + unsigned m_fold_left_latency_penalty; > > > > > > /* Don't do unroll if m_prefer_unroll is false, default is true. > > > > > > */ > > > > > > bool m_prefer_unroll; > > > > > > }; > > > > > > @@ -26197,6 +26199,7 @@ ix86_vector_costs::ix86_vector_costs > > > > > > (vec_info* vinfo, bool costing_for_scalar) > > > > > > m_num_avx256_vec_perm (), > > > > > > m_num_avx512_vec_perm (), > > > > > > m_num_reduc (), > > > > > > + m_fold_left_latency_penalty (0), > > > > > > m_prefer_unroll (true) > > > > > > {} > > > > > > > > > > > > @@ -26826,6 +26829,37 @@ ix86_vector_costs::add_stmt_cost (int > > > > > > count, vect_cost_for_stmt kind, > > > > > > if (stmt_cost == -1) > > > > > > stmt_cost = ix86_default_vector_cost (kind, mode); > > > > > > > > > > > > + /* x86 has no in-order FP reduction instruction, so a FOLD_LEFT > > > > > > FP > > > > > > + reduction is scalarized into per-lane extracts plus a > > > > > > dependent scalar > > > > > > + add chain (vect_expand_fold_left). vect_model_reduction_cost > > > > > > costs > > > > > > + this as ncopies vec_deconstruct ops and ncopies * nunits > > > > > > scalar_stmt > > > > > > + ops, which captures the per-lane work but not its > > > > > > serialization: the > > > > > > + extracts and adds all feed the same scalar accumulator, while > > > > > > the cost > > > > > > + model treats the lanes as independent. With ncopies copies > > > > > > the chain > > > > > > + is ncopies times longer, so add the missing (ncopies - 1) * > > > > > > count * > > > > > > + stmt_cost for both kinds; finish_cost folds it in once. > > > > > > vec_deconstruct > > > > > > + is recorded with count == ncopies and scalar_stmt with count > > > > > > == ncopies > > > > > > + * nunits, hence the per-kind ncopies below. */ > > > > > > + if ((kind == vec_deconstruct || kind == scalar_stmt) > > > > > > + && where == vect_body > > > > > > + && fp > > > > > > + && !m_costing_for_scalar > > > > > > + && is_a<loop_vec_info> (m_vinfo) > > > > > > + && node > > > > > > + && vect_reduc_type (m_vinfo, node) == FOLD_LEFT_REDUCTION) > > > > > > + { > > > > > > + unsigned ncopies = (unsigned) count; > > > > > > + if (kind == scalar_stmt) > > > > > > + { > > > > > > + unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype); > > > > > > + ncopies = (ncopies + nunits - 1) / nunits; > > > > > > + } > > > > > > + if (ncopies > 1) > > > > > > + m_fold_left_latency_penalty > > > > > > + += adjust_cost_for_freq (stmt_info, where, > > > > > > + count * stmt_cost * (ncopies - 1)); > > > > > > > > > > Why multiply by count? How did you arrive at the scaling factor? > > > > > > > > > The original design only added a linear cost penalty to scalar_stmt, > > > > but it still can't prevent vectorization for the testcase in the PR. > > > > So I add a quadratic penalty cost of *count*(count * (ncopies - 1)) to > > > > both scalar_stmt and vec_deconstruct. It's more like a benchmark > > > > number. > > > > > > OK, so I think it's reasonable to simply prefer a low VF if there's > > > any in-order reduction. Or is the intention to disable vectorization > > > completely here? If the former then rather than these magic scaling > > > factors we might want to adjust > > > better_main_loop_than_p/better_epilogue_loop_than_p > > > to explicitly state this intent, outside of the rest of the costing? > > > > I think it prefers low VF * ncopies, not VF only. > > Basically for ncopies == 1, the patch doesn't add any penalty. > > But when ncopies> 1, the penalty is like VF * ncopies, it tries to > > prevent *vec_unpack_{lo,hi}_expr + fold_left_reduc* which increases > > code size too much. > > Hmm, but then this is separate from the in-order reduction. It > doesn't change the number of reduction steps. What does is a > larger VF. Different ncopies at the same VF (so %xmm vs %ymm vs > %zmm at same VF) doesn't change any of the arguments against > in-order reductions either. Eventually it's VF of the smallest data type(char), or VF * ncopies for the real reduction type(float). It represents the number of scalar ops in the reduction chain. > > > vect_patt_24.18_95 = [vec_unpack_lo_expr] vect_patt_23.15_89; > > vect_patt_24.18_96 = [vec_unpack_hi_expr] vect_patt_23.15_89; > > vect_patt_24.18_97 = [vec_unpack_lo_expr] vect_patt_23.15_90; > > vect_patt_24.18_98 = [vec_unpack_hi_expr] vect_patt_23.15_90; > > vect__9.19_99 = (vector(8) float) vect_patt_24.18_95; > > vect__9.19_100 = (vector(8) float) vect_patt_24.18_96; > > vect__9.19_101 = (vector(8) float) vect_patt_24.18_97; > > vect__9.19_102 = (vector(8) float) vect_patt_24.18_98; > > stmp_sum_16.20_103 = BIT_FIELD_REF <vect__9.19_99, 32, 0>; > > stmp_sum_16.20_104 = sum_20 + stmp_sum_16.20_103; > > stmp_sum_16.20_105 = BIT_FIELD_REF <vect__9.19_99, 32, 32>; > > stmp_sum_16.20_106 = stmp_sum_16.20_104 + stmp_sum_16.20_105; > > stmp_sum_16.20_107 = BIT_FIELD_REF <vect__9.19_99, 32, 64>; > > stmp_sum_16.20_108 = stmp_sum_16.20_106 + stmp_sum_16.20_107; > > stmp_sum_16.20_109 = BIT_FIELD_REF <vect__9.19_99, 32, 96>; > > stmp_sum_16.20_110 = stmp_sum_16.20_108 + stmp_sum_16.20_109; > > stmp_sum_16.20_111 = BIT_FIELD_REF <vect__9.19_99, 32, 128>; > > stmp_sum_16.20_112 = stmp_sum_16.20_110 + stmp_sum_16.20_111; > > stmp_sum_16.20_113 = BIT_FIELD_REF <vect__9.19_99, 32, 160>; > > stmp_sum_16.20_114 = stmp_sum_16.20_112 + stmp_sum_16.20_113; > > stmp_sum_16.20_115 = BIT_FIELD_REF <vect__9.19_99, 32, 192>; > > stmp_sum_16.20_116 = stmp_sum_16.20_114 + stmp_sum_16.20_115; > > stmp_sum_16.20_117 = BIT_FIELD_REF <vect__9.19_99, 32, 224>; > > stmp_sum_16.20_118 = stmp_sum_16.20_116 + stmp_sum_16.20_117; > > stmp_sum_16.20_119 = BIT_FIELD_REF <vect__9.19_100, 32, 0>; > > stmp_sum_16.20_120 = stmp_sum_16.20_118 + stmp_sum_16.20_119; > > stmp_sum_16.20_121 = BIT_FIELD_REF <vect__9.19_100, 32, 32>; > > stmp_sum_16.20_122 = stmp_sum_16.20_120 + stmp_sum_16.20_121; > > stmp_sum_16.20_123 = BIT_FIELD_REF <vect__9.19_100, 32, 64>; > > stmp_sum_16.20_124 = stmp_sum_16.20_122 + stmp_sum_16.20_123; > > stmp_sum_16.20_125 = BIT_FIELD_REF <vect__9.19_100, 32, 96>; > > stmp_sum_16.20_126 = stmp_sum_16.20_124 + stmp_sum_16.20_125; > > stmp_sum_16.20_127 = BIT_FIELD_REF <vect__9.19_100, 32, 128>; > > stmp_sum_16.20_128 = stmp_sum_16.20_126 + stmp_sum_16.20_127; > > stmp_sum_16.20_129 = BIT_FIELD_REF <vect__9.19_100, 32, 160>; > > stmp_sum_16.20_130 = stmp_sum_16.20_128 + stmp_sum_16.20_129; > > stmp_sum_16.20_131 = BIT_FIELD_REF <vect__9.19_100, 32, 192>; > > stmp_sum_16.20_132 = stmp_sum_16.20_130 + stmp_sum_16.20_131; > > stmp_sum_16.20_133 = BIT_FIELD_REF <vect__9.19_100, 32, 224>; > > stmp_sum_16.20_134 = stmp_sum_16.20_132 + stmp_sum_16.20_133; > > stmp_sum_16.20_135 = BIT_FIELD_REF <vect__9.19_101, 32, 0>; > > stmp_sum_16.20_136 = stmp_sum_16.20_134 + stmp_sum_16.20_135; > > stmp_sum_16.20_137 = BIT_FIELD_REF <vect__9.19_101, 32, 32>; > > stmp_sum_16.20_138 = stmp_sum_16.20_136 + stmp_sum_16.20_137; > > stmp_sum_16.20_139 = BIT_FIELD_REF <vect__9.19_101, 32, 64>; > > stmp_sum_16.20_140 = stmp_sum_16.20_138 + stmp_sum_16.20_139; > > stmp_sum_16.20_141 = BIT_FIELD_REF <vect__9.19_101, 32, 96>; > > stmp_sum_16.20_142 = stmp_sum_16.20_140 + stmp_sum_16.20_141; > > stmp_sum_16.20_143 = BIT_FIELD_REF <vect__9.19_101, 32, 128>; > > stmp_sum_16.20_144 = stmp_sum_16.20_142 + stmp_sum_16.20_143; > > stmp_sum_16.20_145 = BIT_FIELD_REF <vect__9.19_101, 32, 160>; > > stmp_sum_16.20_146 = stmp_sum_16.20_144 + stmp_sum_16.20_145; > > stmp_sum_16.20_147 = BIT_FIELD_REF <vect__9.19_101, 32, 192>; > > stmp_sum_16.20_148 = stmp_sum_16.20_146 + stmp_sum_16.20_147; > > stmp_sum_16.20_149 = BIT_FIELD_REF <vect__9.19_101, 32, 224>; > > stmp_sum_16.20_150 = stmp_sum_16.20_148 + stmp_sum_16.20_149; > > stmp_sum_16.20_151 = BIT_FIELD_REF <vect__9.19_102, 32, 0>; > > stmp_sum_16.20_152 = stmp_sum_16.20_150 + stmp_sum_16.20_151; > > stmp_sum_16.20_153 = BIT_FIELD_REF <vect__9.19_102, 32, 32>; > > stmp_sum_16.20_154 = stmp_sum_16.20_152 + stmp_sum_16.20_153; > > stmp_sum_16.20_155 = BIT_FIELD_REF <vect__9.19_102, 32, 64>; > > stmp_sum_16.20_156 = stmp_sum_16.20_154 + stmp_sum_16.20_155; > > stmp_sum_16.20_157 = BIT_FIELD_REF <vect__9.19_102, 32, 96>; > > stmp_sum_16.20_158 = stmp_sum_16.20_156 + stmp_sum_16.20_157; > > stmp_sum_16.20_159 = BIT_FIELD_REF <vect__9.19_102, 32, 128>; > > stmp_sum_16.20_160 = stmp_sum_16.20_158 + stmp_sum_16.20_159; > > stmp_sum_16.20_161 = BIT_FIELD_REF <vect__9.19_102, 32, 160>; > > stmp_sum_16.20_162 = stmp_sum_16.20_160 + stmp_sum_16.20_161; > > stmp_sum_16.20_163 = BIT_FIELD_REF <vect__9.19_102, 32, 192>; > > stmp_sum_16.20_164 = stmp_sum_16.20_162 + stmp_sum_16.20_163; > > stmp_sum_16.20_165 = BIT_FIELD_REF <vect__9.19_102, 32, 224>; > > sum_16 = stmp_sum_16.20_164 + stmp_sum_16.20_165; > > > > > > > > > > > > > > So you are penaltizing the deconstruction and the scalar op the same? > > > > > The 'cost' for the deconstruction is the overall work, it's latency > > > > > is lower when there's infinite throughput and in particular we do > > > > > not have to wait for all parts to be ready but can interleave the > > > > > scalar operations with the deconstructed lanes arriving, improving > > > > > instruction issue and avoiding throughput bottlenecks. > > > > > > > > > > That said, I know we're doing the above, but we're mixing > > > > > latency and number-of-overall-ops (which is what 'cost' is > > > > > accumulating) > > > > Yes, I think we already mix latency and throughput for the vectorized > > > > stmt cost. > > > > > > > > > > > + } > > > > > > + > > > > > > /* BIT_FIELD_REF <vect_**, 64, 0> with count 0 costs 0 in body. > > > > > > */ > > > > > > if (kind == vec_perm && vectype && count != 0) > > > > > > { > > > > > > @@ -26960,6 +26994,9 @@ ix86_vector_costs::finish_cost (const > > > > > > vector_costs *scalar_costs) > > > > > > > > > > > > } > > > > > > > > > > > > + if (m_fold_left_latency_penalty && m_costs[vect_body] != INT_MAX) > > > > > > + m_costs[vect_body] += m_fold_left_latency_penalty; > > > > > > + > > > > > > > > > > So you are just adding that here - why delay that? This way it > > > > > doesn't > > > > > appear in any of the dumps. An alternative might be to just track > > > > > whether there is a fold-left reduction (or how many "SSE bits" of it) > > > > > and use that as a "tie breaker" in > > > > > better_main_loop_than_p/better_epilogue_loop_than_p? > > > > > > > > > > IMO we might want to (again, did so a few years ago without success) > > > > > try > > > > > to build a dependence DAG with uops we create during costing and at > > > > > finish_cost time compute overall latency. When I attempted this I > > > > > made this _very_ simple, assuming infinite throughput and likely too > > > > > coarse grained "ops" - it was as much GIGO as the current scheme. > > > > I can let AI play with it to see if we can have some findings ;) > > > > > > Heh, wish you luck. I didn't yet have much of that when trying AI > > > for prototyping of ideas ;) > > > > > > Richard. > > > > > > > > > > > > > > > > > Thanks, > > > > > Richard. > > > > > > > > > > > ix86_vect_estimate_reg_pressure (); > > > > > > > > > > > > for (int i = 0; i != 3; i++) > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c > > > > > > b/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c > > > > > > new file mode 100644 > > > > > > index 00000000000..5f26f968429 > > > > > > --- /dev/null > > > > > > +++ b/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c > > > > > > @@ -0,0 +1,17 @@ > > > > > > +/* { dg-do compile } */ > > > > > > +/* { dg-options "-O2 -march=x86-64-v3 -fdump-tree-vect-details" } > > > > > > */ > > > > > > +/* { dg-final { scan-tree-dump "vectorized 0 loops in function" > > > > > > "vect" } } */ > > > > > > + > > > > > > +/* Loop driven by a uint8_t load co-located with float ops makes > > > > > > the > > > > > > + vectorizer pick a large VF (e.g. VF=32) so that ncopies > 1 for > > > > > > the > > > > > > + FOLD_LEFT FP reduction. The per-copy scalar-add chain > > > > > > serialization > > > > > > + should make vectorization unprofitable. */ > > > > > > + > > > > > > +float > > > > > > +foo (char *a, char *b, int n) > > > > > > +{ > > > > > > + float sum = 0; > > > > > > + for (int i = 0; i != n; i++) > > > > > > + sum += a[i] * b[i]; > > > > > > + return sum; > > > > > > +} > > > > > > > > > > > > > > > > -- > > > > > Richard Biener <[email protected]> > > > > > SUSE Software Solutions Germany GmbH, > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > > > GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > > > > Nuernberg) > > > > > > > > > > > > > > > > > > > > > > -- > > > Richard Biener <[email protected]> > > > SUSE Software Solutions Germany GmbH, > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > > Nuernberg) > > > > > > > > > > -- > Richard Biener <[email protected]> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
-- BR, Hongtao
