On Tue, 13 May 2025, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <rguent...@suse.de> > > Sent: Tuesday, May 13, 2025 1:59 PM > > To: Tamar Christina <tamar.christ...@arm.com> > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > > Subject: Re: [PATCH 1/2]middle-end: Apply loop->unroll directly in > > vectorizer > > > > On Tue, 13 May 2025, Tamar Christina wrote: > > > > > Hi All, > > > > > > Consider the loop > > > > > > void f1 (int *restrict a, int n) > > > { > > > #pragma GCC unroll 4 requested > > > for (int i = 0; i < n; i++) > > > a[i] *= 2; > > > } > > > > > > Which today is vectorized and then unrolled 3x by the RTL unroller due to > > > the > > > use of the pragma. This is unfortunate because the pragma was intended > > > for the > > > scalar loop but we end up with an unrolled vector loop and a longer path > > > to the > > > entry which has a low enough VF requirement to enter. > > > > > > This patch instead seeds the suggested_unroll_factor with the value the > > > user > > > requested and instead uses it to maintain the total VF that the user > > > wanted the > > > scalar loop to maintain. > > > > > > In effect it applies the unrolling inside the vector loop itself. This > > > has the > > > benefits for things like reductions, as it allows us to split the > > > accumulator > > > and so the unrolled loop is more efficient. For early-break it allows the > > > cbranch call to be shared between the unrolled elements, giving you more > > > effective unrolling because it doesn't need the repeated cbranch which > > > can be > > > expensive. > > > > > > The target can then choose to create multiple epilogues to deal with the > > > "rest". > > > > > > The example above now generates: > > > > > > .L4: > > > ldr q31, [x2] > > > add v31.4s, v31.4s, v31.4s > > > str q31, [x2], 16 > > > cmp x2, x3 > > > bne .L4 > > > > > > as V4SI maintains the requested VF, but e.g. pragma unroll 8 generates: > > > > > > .L4: > > > ldp q30, q31, [x2] > > > add v30.4s, v30.4s, v30.4s > > > add v31.4s, v31.4s, v31.4s > > > stp q30, q31, [x2], 32 > > > cmp x3, x2 > > > bne .L4 > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, > > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu > > > -m32, -m64 and no issues. > > > > > > Ok for master? > > > > > > Thanks, > > > Tamar > > > > > > gcc/ChangeLog: > > > > > > * tree-vectorizer.h (vector_costs::set_suggested_unroll_factor, > > > LOOP_VINFO_USER_UNROLL): New. > > > (class _loop_vec_info): Add user_unroll. > > > * tree-vect-loop.cc (vect_estimate_min_profitable_iters): Set > > > suggested_unroll_factor before calling backend costing. > > > (_loop_vec_info::_loop_vec_info): Initialize user_unroll. > > > (vect_transform_loop): Clear the loop->unroll value if the pragma was > > > used. > > > > > > --- > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > > index > > fe6f3cf188e40396b299ff9e814cc402bc2d4e2d..a13e4978bc7ed651be3a65d24 > > 3e84c5aaf706f65 100644 > > > --- a/gcc/tree-vect-loop.cc > > > +++ b/gcc/tree-vect-loop.cc > > > @@ -1073,6 +1073,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, > > vec_info_shared *shared) > > > peeling_for_gaps (false), > > > peeling_for_niter (false), > > > early_breaks (false), > > > + user_unroll (false), > > > no_data_dependencies (false), > > > has_mask_store (false), > > > scalar_loop_scaling (profile_probability::uninitialized ()), > > > @@ -4983,6 +4984,26 @@ vect_estimate_min_profitable_iters (loop_vec_info > > loop_vinfo, > > > } > > > } > > > > > > + /* Seed the target cost model with what the user requested if the > > > unroll > > > + factor is larger than 1 vector VF. */ > > > + auto user_unroll = LOOP_VINFO_LOOP (loop_vinfo)->unroll; > > > + if (user_unroll > 1) > > > + { > > > + LOOP_VINFO_USER_UNROLL (loop_vinfo) = true; > > > + int unroll_fact = user_unroll / assumed_vf; > > > + unroll_fact = 1 << ceil_log2 (unroll_fact); > > > + if (unroll_fact > 1) > > > + { > > > + if (dump_enabled_p ()) > > > + dump_printf_loc (MSG_NOTE, vect_location, > > > + "setting unroll factor to %d based on user requested " > > > + "unroll factor %d and suggested vectorization " > > > + "factor: %d\n", > > > + unroll_fact, user_unroll, assumed_vf); > > > + loop_vinfo->vector_costs->set_suggested_unroll_factor (unroll_fact); > > > > So usually targets apply this in finish_cost () so the vectorizer > > tries again with the suggested unroll factor. So that's what we > > then do unless the target overrides the factor again? > > Yes, I intended to let the target be able to override the unrolling, because > with > In particular -mcpu we know about issue rates and throughput limitations. > > We can tell when unrolling would result in slower code, for instance if it's > putting too much bottleneck on cbranch for early break for instance. > > > > > But then ... > > > > > + } > > > + } > > > + > > > /* Complete the target-specific cost calculations. */ > > > loop_vinfo->vector_costs->finish_cost (loop_vinfo->scalar_costs); > > > vec_prologue_cost = loop_vinfo->vector_costs->prologue_cost (); > > > @@ -12364,14 +12385,20 @@ vect_transform_loop (loop_vec_info loop_vinfo, > > gimple *loop_vectorized_call) > > > GET_MODE_NAME (loop_vinfo->vector_mode)); > > > } > > > > > > - /* Loops vectorized with a variable factor won't benefit from > > > + /* Loops vectorized would have already taken into account unrolling > > > specified > > > + by the user as the suggested unroll factor, as such we need to > > > prevent the > > > + RTL unroller from unrolling twice. The only exception is static > > > known > > > + iterations where we would have expected the loop to be fully > > > unrolled. > > > + Loops vectorized with a variable factor won't benefit from > > > unrolling/peeling. */ > > > - if (!vf.is_constant ()) > > > + if (LOOP_VINFO_USER_UNROLL (loop_vinfo) > > > > ... this is the transform phase - is LOOP_VINFO_USER_UNROLL copied > > from the earlier attempt? > > Ah, I see I forgot to copy it when the loop_vinfo is copied.. Will fix. > > > > > If the #pragma unroll number doesn't exactly match the VF and followup > > unrolling cannot make it matched, what should we do? You unconditionally > > disable any followup unrolling in this case. > > > > Yes. This can also happen because the cost model could have determined that > For this micro-architecture, further unrolling is detrimental. i.e. there's > no point > In overloading your load units needlessly for instance. > > The patch takes the stance that it leaves it up to the backend, and the > backend is > In control of the vector code. The pragma was intended to apply to scalar > anyway. > > > If using the cost set suggested-unroll-factor this could have all > > be done from within the target. I had expected you'd seed > > suggested_unroll_factor by loop->unroll for the first analysis > > attempt somehow, altering the vectorization factor chosen? But > > it might require quite some refactoring to make that possibility > > not too ugly. > > > > It also seems to me that we won't prefer wider vectors with > > #pragma unroll, since IIRC with using suggested_unroll_factor > > we keep the chosen vector_mode the same? In particular with > > > > #pramga GCC unroll 4 > > > > and SImode data you'd still get VF == 8 with -mavx2, just not > > extra unrolling? It might be reasonable to expect SSE vectors > > to be chosen there to match the requested unroll factor, even > > when compiling with -mavx2? > > Yeah it only focused on using it as an unrolling factor. As I thought that's > what > you meant with seeding the unroll factor. I guess instead if we want > to consider wider vector modes then it should instead increase the VF itself. > > But I'm wondering what happens if we do so it and it can't find a mode to > satisfy > It. Increasing the VF runs the risk that the vectorizer thinks it can't > vectorize > The loop doesn't it?
No, it's more complicated give we fix the vector mode used without considering the VF at all, which is why I said it might need quite some refactoring to allow the VF to determine the mode used. > I did when I was trying this first try to increase assumed_vf, and have > It calculate it based on that. But it would ICE later on if the unroll > factor was > Too high. E.g. something silly like pragma GCC unroll 80. > > But did you want to go down the road of increasing VF instead? I did originally think that this is what we should do - but I don't think we're ready for this yet. Note applying the suggested unroll factor also simply increases the VF - but by a integer factor from what is determined by earlier analysis. And that feature is designed around leaving the mode the same - it just applies extra unrolling. In the end whatever we do it's going to be a matter of documenting the interaction between vectorization and #pragma GCC unroll. The way you handle it is reasonable, the question is whether to set loop->unroll to 1 in the end (disable any further unrolling) or to 0 (only auto-unroll based on heuristics). I'd argue 0 makes more sense - iff we chose to apply the extra unrolling during vectorization. Richard. > Thanks, > Tamar > > > > > Richard. > > > > > + || !vf.is_constant ()) > > > { > > > loop->unroll = 1; > > > if (dump_enabled_p ()) > > > dump_printf_loc (MSG_NOTE, vect_location, "Disabling unrolling due to" > > > - " variable-length vectorization factor\n"); > > > + " having already considered vector unrolling or " > > > + "variable-length vectorization factor.\n"); > > > } > > > /* Free SLP instances here because otherwise stmt reference counting > > > won't work. */ > > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > > > index > > a2f33a5ecd60288fe7f28ee639ff8b6a77667796..5675309ab16dc7f7f0b98e229 > > e4798075e6cdd7e 100644 > > > --- a/gcc/tree-vectorizer.h > > > +++ b/gcc/tree-vectorizer.h > > > @@ -970,6 +970,10 @@ public: > > > /* Main loop IV cond. */ > > > gcond* loop_iv_cond; > > > > > > + /* True if we have an unroll factor requested by the user through > > > pragma GCC > > > + unroll. */ > > > + bool user_unroll; > > > + > > > /* True if there are no loop carried data dependencies in the loop. > > > If loop->safelen <= 1, then this is always true, either the loop > > > didn't have any loop carried data dependencies, or the loop is being > > > @@ -1094,6 +1098,7 @@ public: > > > #define LOOP_VINFO_CHECK_UNEQUAL_ADDRS(L) (L)->check_unequal_addrs > > > #define LOOP_VINFO_CHECK_NONZERO(L) (L)->check_nonzero > > > #define LOOP_VINFO_LOWER_BOUNDS(L) (L)->lower_bounds > > > +#define LOOP_VINFO_USER_UNROLL(L) (L)->user_unroll > > > #define LOOP_VINFO_GROUPED_STORES(L) (L)->grouped_stores > > > #define LOOP_VINFO_SLP_INSTANCES(L) (L)->slp_instances > > > #define LOOP_VINFO_SLP_UNROLLING_FACTOR(L) (L)->slp_unrolling_factor > > > @@ -1693,6 +1698,7 @@ public: > > > unsigned int outside_cost () const; > > > unsigned int total_cost () const; > > > unsigned int suggested_unroll_factor () const; > > > + void set_suggested_unroll_factor (unsigned int); > > > machine_mode suggested_epilogue_mode () const; > > > > > > protected: > > > @@ -1791,6 +1797,15 @@ vector_costs::suggested_unroll_factor () const > > > return m_suggested_unroll_factor; > > > } > > > > > > +/* Set the suggested unroll factor. */ > > > + > > > +inline void > > > +vector_costs::set_suggested_unroll_factor (unsigned unroll_fact) > > > +{ > > > + gcc_checking_assert (!m_finished); > > > + m_suggested_unroll_factor = unroll_fact; > > > +} > > > + > > > /* Return the suggested epilogue mode. */ > > > > > > inline machine_mode > > > > > > > > > > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)