Re: [PATCH 1v2/3][vect] Add main vectorized loop unrolling

Andre Vieira (lists) via Gcc-patches Thu, 30 Sep 2021 01:54:35 -0700

Hi,

That just forces trying the vector modes we've tried before. Though I might
need to revisit this now I think about it. I'm afraid it might be possible for
this to generate an epilogue with a vf that is not lower than that of the main
loop, but I'd need to think about this again.

Either way I don't think this changes the vector modes used for the epilogue.
But maybe I'm just missing your point here.

Yes, I was refering to the above which suggests that when we vectorize
the main loop with V4SF but unroll then we try vectorizing the
epilogue with V4SF as well (but not unrolled).  I think that's
premature (not sure if you try V8SF if the main loop was V4SF but
unrolled 4 times).

My main motivation for this was because I had a SVE loop that vectorizedwith both VNx8HI, then V8HI which beat VNx8HI on cost, then it decidedto unroll V8HI by two and skipped using VNx8HI as a predicated epiloguewhich would've been the best choice.

So that is why I decided to just 'reset' the vector_mode selection. In ascenario where you only have the traditional vector modes it might makeless sense.

Just realized I still didn't add any check to make sure the epilogue hasa lower VF than the previous loop, though I'm still not sure that couldhappen. I'll go look at where to add that if you agree with this.

I can move it there, it would indeed remove the need for the change to
vect_update_vf_for_slp, the change to
vect_determine_partial_vectors_and_peeling would still be required I think. It
is meant to disable using partial vectors in an unrolled loop.

Why would we disable the use of partial vectors in an unrolled loop?

The motivation behind that is that the overhead caused by generatingpredicates for each iteration will likely be too much for it to beprofitable to unroll. On top of that, when dealing with low iterationcount loops, if executing one predicated iteration would be enough wenow still need to execute all other unrolled predicated iterations,whereas if we keep them unrolled we skip the unrolled loops.

Sure but I'm suggesting you keep the not unrolled body as one way of
costed vectorization but then if the target says "try unrolling"
re-do the analysis with the same mode but a larger VF.  Just like
we iterate over vector modes you'll now iterate over pairs of
vector mode + VF (unroll factor).  It's not about re-using the costing
it's about using costing that is actually relevant and also to avoid
targets inventing two distinct separate costings - a target (powerpc)
might already compute load/store density and other stuff for the main
costing so it should have an idea whether doubling or triplicating is OK.

Richard.

Sounds good! I changed the patch to determine the unrolling factorlater, after all analysis has been done and retry analysis if anunrolling factor larger than 1 has been chosen for this loop andvector_mode.


gcc/ChangeLog:

        * doc/tm.texi: Document TARGET_VECTORIZE_UNROLL_FACTOR.
        * doc/tm.texi.in: Add entries for TARGET_VECTORIZE_UNROLL_FACTOR.

* params.opt: Add vect-unroll and vect-unroll-reductionsparameters.

        * target.def: Define hook TARGET_VECTORIZE_UNROLL_FACTOR.
        * targhooks.c (default_unroll_factor): New.
        * targhooks.h (default_unroll_factor): Likewise.
        * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
        par_unrolling_factor.

(vect_determine_partial_vectors_and_peeling): Account forunrolling.

        (vect_determine_unroll_factor): New.
        (vect_try_unrolling): New.
        (vect_reanalyze_as_main_loop): Call vect_try_unrolling when
        retrying a loop_vinfo as a main loop.

(vect_analyze_loop): Call vect_try_unrolling when vectorizingmain loops. (vect_analyze_loop): Allow for epilogue vectorization whenunrolling

        and rewalk vector_mode warray for the epilogues.

(vectorizable_reduction): Disable single_defuse_cycle whenunrolling. * tree-vectorizer.h (vect_unroll_value): Declarepar_unrolling_factor

        as a member of loop_vec_info.

diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 
be8148583d8571b0d035b1938db9d056bfd213a8..71ee33a200fcbd37ccd5380321df507ae1e8961f
 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6289,6 +6289,12 @@ allocated by TARGET_VECTORIZE_INIT_COST.  The default 
releases the
 accumulator.
 @end deftypefn
 
+@deftypefn {Target Hook} unsigned TARGET_VECTORIZE_UNROLL_FACTOR (class 
vec_info *@var{vinfo})
+This hook should return the desired vector unrolling factor for a loop with
+@var{vinfo}. The default returns one, which means no unrolling will be
+performed.
+@end deftypefn
+
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_GATHER (const_tree 
@var{mem_vectype}, const_tree @var{index_type}, int @var{scale})
 Target builtin that implements vector gather operation.  @var{mem_vectype}
 is the vector type of the load and @var{index_type} is scalar type of
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 
d088eee4afecdbb5575b0f4f796ac344e4449155..3b3051f565ccbf88b07ee4f9f28e53cf6048d2e0
 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4198,6 +4198,8 @@ address;  but often a machine-dependent strategy can 
generate better code.
 
 @hook TARGET_VECTORIZE_DESTROY_COST_DATA
 
+@hook TARGET_VECTORIZE_UNROLL_FACTOR
+
 @hook TARGET_VECTORIZE_BUILTIN_GATHER
 
 @hook TARGET_VECTORIZE_BUILTIN_SCATTER
diff --git a/gcc/params.opt b/gcc/params.opt
index 
658ca0288519e5f8185da67535dc42517c24d21c..d6c625c0a78e3aa21837c96a6757a57337cce22f
 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1137,4 +1137,12 @@ Controls how loop vectorizer uses partial vectors.  0 
means never, 1 means only
 Common Joined UInteger Var(param_vect_inner_loop_cost_factor) Init(50) 
IntegerRange(1, 10000) Param Optimization
 The maximum factor which the loop vectorizer applies to the cost of statements 
in an inner loop relative to the loop being vectorized.
 
+-param=vect-unroll=
+Common Joined UInteger Var(param_vect_unroll) Init(0) IntegerRange(0, 32) 
Param Optimization
+Controls how many times the vectorizer tries to unroll loops.  Also see 
vect-unroll-reductions.
+
+-param=vect-unroll-reductions=
+Common Joined UInteger Var(param_vect_unroll_reductions) Init(0) 
IntegerRange(0, 32) Param Optimization
+Controls how many times the vectorizer tries to unroll loops that contain 
associative reductions.  0 means that such loops should be unrolled vect-unroll 
times.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/target.def b/gcc/target.def
index 
bfa819609c21bd71c0cc585c01dba42534453f47..8f48a453f3ff886381119291413514fae0e666ec
 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2095,6 +2095,16 @@ accumulator.",
  (void *data),
  default_destroy_cost_data)
 
+/* Function to determine unroll factor for vectorization.  */
+DEFHOOK
+(unroll_factor,
+ "This hook should return the desired vector unrolling factor for a loop 
with\n\
+@var{vinfo}. The default returns one, which means no unrolling will be\n\
+performed.",
+ unsigned,
+ (class vec_info *vinfo),
+ default_unroll_factor)
+
 HOOK_VECTOR_END (vectorize)
 
 #undef HOOK_PREFIX
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 
92d51992e625c2497aa8496b1e2e3d916e5706fd..237d7f52b7f9ec903fbc265c98b153cc174b7457
 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -125,6 +125,7 @@ extern unsigned default_add_stmt_cost (class vec_info *, 
void *, int,
                                       enum vect_cost_model_location);
 extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *);
 extern void default_destroy_cost_data (void *);
+extern unsigned default_unroll_factor (class vec_info *);
 
 /* OpenACC hooks.  */
 extern bool default_goacc_validate_dims (tree, int [], int, unsigned);
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index 
c9b5208853dbc15706a65d1eb335e28e0564325e..826cbe6abb79c6b17f531cfb8332221c362dc500
 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1535,6 +1535,14 @@ default_destroy_cost_data (void *data)
   free (data);
 }
 
+/* By default, return a vector unroll factor of one, meaning no unrolling will
+   be performed.  */
+unsigned
+default_unroll_factor (class vec_info *vinfo ATTRIBUTE_UNUSED)
+{
+  return 1;
+}
+
 /* Determine whether or not a pointer mode is valid. Assume defaults
    of ptr_mode or Pmode - can be overridden.  */
 bool
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 
5a5b8da2e771a1dd204f22a6447eba96bb3b352c..39b50010e081dd3f03cb37c9a55b2bba49981ec8
 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -365,6 +365,24 @@ vect_determine_vectorization_factor (loop_vec_info 
loop_vinfo)
   if (known_le (vectorization_factor, 1U))
     return opt_result::failure_at (vect_location,
                                   "not vectorized: unsupported data-type\n");
+  /* Apply unrolling factor, this was determined by
+     vect_determine_unroll_factor the first time we ran the analysis for this
+     vector mode.  */
+  if (loop_vinfo->par_unrolling_factor > 1)
+    {
+      unsigned unrolling_factor = loop_vinfo->par_unrolling_factor;
+      while (unrolling_factor > 1)
+       {
+         poly_uint64 candidate_factor = vectorization_factor * 
unrolling_factor;
+         if (estimated_poly_value (candidate_factor, POLY_VALUE_MAX)
+             <= (HOST_WIDE_INT) LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo))
+           {
+             vectorization_factor = candidate_factor;
+             break;
+           }
+         unrolling_factor /= 2;
+       }
+    }
   LOOP_VINFO_VECT_FACTOR (loop_vinfo) = vectorization_factor;
   return opt_result::success ();
 }
@@ -828,6 +846,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, 
vec_info_shared *shared)
     skip_main_loop_edge (nullptr),
     skip_this_loop_edge (nullptr),
     reusable_accumulators (),
+    par_unrolling_factor (1),
     max_vectorization_factor (0),
     mask_skip_niters (NULL_TREE),
     rgroup_compare_type (NULL_TREE),
@@ -2128,10 +2147,16 @@ vect_determine_partial_vectors_and_peeling 
(loop_vec_info loop_vinfo,
         vectors to the epilogue, with the main loop continuing to operate
         on full vectors.
 
+        If we are unrolling we also do not want to use partial vectors. This
+        is to avoid the overhead of generating multiple masks and also to
+        avoid having to execute entire iterations of FALSE masked instructions
+        when dealing with one or less full iterations.
+
         ??? We could then end up failing to use partial vectors if we
         decide to peel iterations into a prologue, and if the main loop
         then ends up processing fewer than VF iterations.  */
-      if (param_vect_partial_vector_usage == 1
+      if ((param_vect_partial_vector_usage == 1
+          || loop_vinfo->par_unrolling_factor > 1)
          && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
          && !vect_known_niters_smaller_than_vf (loop_vinfo))
        LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
@@ -2879,6 +2904,116 @@ vect_joust_loop_vinfos (loop_vec_info new_loop_vinfo,
   return true;
 }
 
+/* Determine whether we should unroll this loop and ask target how much to
+   unroll by.  */
+
+static opt_loop_vec_info
+vect_determine_unroll_factor (loop_vec_info loop_vinfo)
+{
+  stmt_vec_info stmt_info;
+  unsigned i;
+  bool seen_reduction_p = false;
+  poly_uint64 vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+
+  FOR_EACH_VEC_ELT (loop_vinfo->stmt_vec_infos, i, stmt_info)
+    {
+      if (STMT_VINFO_IN_PATTERN_P (stmt_info)
+         || !STMT_VINFO_RELEVANT_P (stmt_info)
+         || stmt_info->vectype == NULL_TREE)
+       continue;
+      /* Do not unroll loops with negative steps as it is unlikely that
+        vectorization will succeed due to the way we deal with negative steps
+        in loads and stores in 'get_load_store_type'.  */
+      if (stmt_info->dr_aux.dr
+         && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+       {
+         dr_vec_info *dr_info = STMT_VINFO_DR_INFO (stmt_info);
+         tree step = vect_dr_behavior (loop_vinfo, dr_info)->step;
+         if (TREE_CODE (step) == INTEGER_CST
+             && tree_int_cst_compare (step, size_zero_node) < 0)
+           {
+             return opt_loop_vec_info::failure_at
+               (vect_location, "could not unroll due to negative step\n");
+           }
+       }
+
+      if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def)
+       {
+         auto red_info = info_for_reduction (loop_vinfo, stmt_info);
+         if (STMT_VINFO_REDUC_TYPE (red_info) == TREE_CODE_REDUCTION)
+           seen_reduction_p = true;
+         else
+           {
+             return opt_loop_vec_info::failure_at
+               (vect_location, "could not unroll loop with reduction due to "
+                               "non TREE_CODE_REDUCTION\n");
+           }
+       }
+    }
+
+  unsigned int unrolling_factor = 1;
+  if (maybe_gt (vectorization_factor, 1U))
+    unrolling_factor = vect_unroll_value (loop_vinfo, seen_reduction_p);
+
+  opt_loop_vec_info unrolled_vinfo
+    = opt_loop_vec_info::success (vect_analyze_loop_form (loop_vinfo->loop,
+                                                         loop_vinfo->shared));
+  unrolled_vinfo->vector_mode = loop_vinfo->vector_mode;
+  unrolled_vinfo->par_unrolling_factor = unrolling_factor;
+  return unrolled_vinfo;
+}
+
+
+/* Try to unroll the current loop.  First determine the unrolling factor using
+   the analysis done for the current vector mode.  Then re-analyze the loop for
+   the given unrolling factor and the current vector mode.  */
+
+static opt_loop_vec_info
+vect_try_unrolling (opt_loop_vec_info loop_vinfo, unsigned *n_stmts)
+{
+  DUMP_VECT_SCOPE ("vect_try_unrolling");
+
+  opt_loop_vec_info unrolled_vinfo = vect_determine_unroll_factor (loop_vinfo);
+  if (unrolled_vinfo)
+    {
+      if (unrolled_vinfo->par_unrolling_factor > 1)
+       {
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_NOTE, vect_location,
+                            "***** unrolling factor %d chosen for vector mode 
%s,"
+                            "re-trying analysis...\n",
+                            unrolled_vinfo->par_unrolling_factor,
+                            GET_MODE_NAME (unrolled_vinfo->vector_mode));
+         bool unrolling_fatal = false;
+         if (vect_analyze_loop_2 (unrolled_vinfo, unrolling_fatal, n_stmts)
+             && known_ne (loop_vinfo->vectorization_factor,
+                          unrolled_vinfo->vectorization_factor))
+           {
+
+             loop_vinfo = unrolled_vinfo;
+             if (dump_enabled_p ())
+               dump_printf_loc (MSG_NOTE, vect_location,
+                                "unrolling succeeded with factor = %d\n",
+                                loop_vinfo->par_unrolling_factor);
+
+           }
+         else
+           {
+             if (dump_enabled_p ())
+               dump_printf_loc (MSG_NOTE, vect_location,
+                                "unrolling failed with factor = %d\n",
+                                unrolled_vinfo->par_unrolling_factor);
+           }
+       }
+      else
+       if (dump_enabled_p ())
+         dump_printf_loc (MSG_NOTE, vect_location,
+                          "target determined unrolling is not profitable.\n");
+    }
+  loop_vinfo->loop->aux = NULL;
+  return loop_vinfo;
+}
+
 /* If LOOP_VINFO is already a main loop, return it unmodified.  Otherwise
    try to reanalyze it as a main loop.  Return the loop_vinfo on success
    and null on failure.  */
@@ -2904,6 +3039,8 @@ vect_reanalyze_as_main_loop (loop_vec_info loop_vinfo, 
unsigned int *n_stmts)
   bool fatal = false;
   bool res = vect_analyze_loop_2 (main_loop_vinfo, fatal, n_stmts);
   loop->aux = NULL;
+  main_loop_vinfo = vect_try_unrolling (main_loop_vinfo, n_stmts);
+
   if (!res)
     {
       if (dump_enabled_p ())
@@ -3038,6 +3175,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared 
*shared)
 
       if (res)
        {
+         /* Only try unrolling main loops.  */
+         if (!LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+           loop_vinfo = vect_try_unrolling (loop_vinfo, &n_stmts);
+
          LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
          vectorized_loops++;
 
@@ -3062,7 +3203,14 @@ vect_analyze_loop (class loop *loop, vec_info_shared 
*shared)
                  gcc_assert (vect_epilogues);
                  delete vinfos.pop ();
                }
+             /* Check if we may want to replace the current first_loop_vinfo
+                with the new loop, but only if they have different vector
+                modes.  If they have the same vector mode this means the main
+                loop is an unrolled loop and we are trying to vectorize the
+                epilogue using the same vector mode but with a lower
+                vectorization factor.  */
              if (vinfos.is_empty ()
+                 && loop_vinfo->vector_mode != first_loop_vinfo->vector_mode
                  && vect_joust_loop_vinfos (loop_vinfo, first_loop_vinfo))
                {
                  loop_vec_info main_loop_vinfo
@@ -3153,13 +3301,32 @@ vect_analyze_loop (class loop *loop, vec_info_shared 
*shared)
 
       /* Handle the case that the original loop can use partial
         vectorization, but want to only adopt it for the epilogue.
-        The retry should be in the same mode as original.  */
+        The retry should be in the same mode as original.
+        Also handle the case where we have unrolled the main loop and want to
+        retry all vector modes again for the epilogues, since the VF is now
+        at least twice as high as the current vector mode.  */
       if (vect_epilogues
          && loop_vinfo
-         && LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo))
+         && (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo)
+             || loop_vinfo->par_unrolling_factor > 1))
        {
-         gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+         gcc_assert ((LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+                      || loop_vinfo->par_unrolling_factor > 1)
                      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo));
+         /* If we are unrolling, try all VECTOR_MODES for the epilogue.  */
+         if (loop_vinfo->par_unrolling_factor > 1)
+           {
+             next_vector_mode = vector_modes[0];
+             mode_i = 1;
+
+             if (dump_enabled_p ())
+               dump_printf_loc (MSG_NOTE, vect_location,
+                                "***** Re-trying analysis with vector mode"
+                                " %s for epilogues after unrolling.\n",
+                                GET_MODE_NAME (next_vector_mode));
+             continue;
+           }
+
          if (dump_enabled_p ())
            dump_printf_loc (MSG_NOTE, vect_location,
                             "***** Re-trying analysis with same vector mode"
@@ -7212,7 +7379,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
    participating.  */
   if (ncopies > 1
       && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
-      && reduc_chain_length == 1)
+      && reduc_chain_length == 1
+      && loop_vinfo->par_unrolling_factor == 1)
     single_defuse_cycle = true;
 
   if (single_defuse_cycle || lane_reduc_code_p)
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 
c4c5678e7f1abafc25c465319dbacf3ef50f0ae9..6a8b6315a8d3337b2f7c8e17fdfec82565bfaab7
 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -621,6 +621,11 @@ public:
      about the reductions that generated them.  */
   hash_map<tree, vect_reusable_accumulator> reusable_accumulators;
 
+  /* The number of times that we've unrolled the vector loop in order
+     to promote more ILP.  This value is folded into vectorization_factor
+     (and therefore exactly divides vectorization_factor).  */
+  unsigned int par_unrolling_factor;
+
   /* Maximum runtime vectorization factor, or MAX_VECTORIZATION_FACTOR
      if there is no particular limit.  */
   unsigned HOST_WIDE_INT max_vectorization_factor;
@@ -1822,6 +1827,20 @@ vect_apply_runtime_profitability_check_p (loop_vec_info 
loop_vinfo)
          && th >= vect_vf_for_cost (loop_vinfo));
 }
 
+/* Return the number of times that we should unroll general
+   reduction-free loops.  */
+
+inline unsigned int
+vect_unroll_value (loop_vec_info loop_vinfo, bool seen_reduction_p)
+{
+  if (seen_reduction_p && param_vect_unroll_reductions >= 1)
+    return param_vect_unroll_reductions;
+  if (param_vect_unroll >= 1)
+    return param_vect_unroll;
+  else
+    return targetm.vectorize.unroll_factor (loop_vinfo);
+}
+
 /* Source location + hotness information. */
 extern dump_user_location_t vect_location;

Re: [PATCH 1v2/3][vect] Add main vectorized loop unrolling

Reply via email to