Hi Richi,

See inline responses to your comments.

On 11/10/2019 13:57, Richard Biener wrote:
On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:

Hi,



+
+  /* Keep track of vector sizes we know we can vectorize the epilogue
with.  */
+  vector_sizes epilogue_vsizes;
  };

please don't enlarge struct loop, instead track this somewhere
in the vectorizer (in loop_vinfo?  I see you already have
epilogue_vinfos there - so the loop_vinfo simply lacks
convenient access to the vector_size?)  I don't see any
use that could be trivially adjusted to look at a loop_vinfo
member instead.

Done.

For the vect_update_inits_of_drs this means that we'd possibly
do less CSE.  Not sure if really an issue.

CSE of what exactly? You are afraid we are repeating a calculation here we have done elsewhere before?


You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes
LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to
LOOP_VINFO_EPILOGUE_P.

I checked and the points where I use LOOP_VINFO_ORIG_LOOP_INFO is because I then use the resulting loop info. If there are cases you feel strongly about let me know.

@@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
niters, tree nitersm1,
    else
      niters_prolog = build_int_cst (type, 0);
+ loop_vec_info epilogue_vinfo = NULL;
+  if (vect_epilogues)
+    {
...
+       vect_epilogues = false;
+    }
+

I don't understand what all this does - it clearly needs a comment.
Maybe the overall comment of the function should be amended with
an overview of how we handle [multiple] epilogue loop vectorization?

I added more comments both here and on top of the function. Hopefully it is a bit clearer now, but it might need some tweaking.


+
+      if (epilogue_any_upper_bound && prolog_peeling >= 0)
+       {
+         epilog->any_upper_bound = true;
+         epilog->nb_iterations_upper_bound = eiters + 1;
+       }
+

comment missing.  How can prolog_peeling be < 0?  We likely
didn't set the upper bound because we don't know it in the
case we skipped the vector loop (skip_vector)?  So make sure
to not introduce wrong-code issues here - maybe do this
optimization as followup?n


So the reason for this code wasn't so much an optimization as it was for correctness. But I was mistaken, the failure I was seeing without this code was not because of this code, but rather being hidden by it. The problem I was seeing was that a prolog was being created using the original loop copy, rather than the scalar loop, leading to MASK_LOAD and MASK_STORE being left in the scalar prolog, leading to expand ICEs. I have fixed that issue by making sure the SCALAR_LOOP is used for prolog peeling and either the loop copy or SCALAR loop for epilogue peeling depending on whether we will be vectorizing said epilogue.


@@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info
loop_vinfo)
        return 0;
      }

-  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
+  HOST_WIDE_INT estimated_niter = -1;
+
+  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+    estimated_niter
+      = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;
+  if (estimated_niter == -1)
+    estimated_niter = estimated_stmt_executions_int (loop);
    if (estimated_niter == -1)
      estimated_niter = likely_max_stmt_executions_int (loop);
    if (estimated_niter != -1

it's clearer if the old code is completely in a else {} path
even though vect_vf_for_cost - 1 should never be -1.

Done for the == -1 cases, need to keep the != -1 outside of course.
-  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
+  if (LOOP_REQUIRES_VERSIONING (loop_vinfo)
+      || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+         && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo)))

not sure why we need to do this for epilouges?


This is because we want to compute the versioning threshold for epilogues such that we can use the minimum versioning threshold when versioning the main loop. The reason we need to ask we need to ask the original main loop is partially because of code in 'vect_analyze_data_ref_dependences' that chooses to not do DR dependence analysis and thus never fills LOOP_VINFO_MAY_ALIAS_DDRS for the epilogues loop_vinfo and as a consequence LOOP_VINFO_COMP_ALIAS_DDRS is always 0.

The piece of code is preceded by this comment:
  /* For epilogues we either have no aliases or alias versioning
     was applied to original loop.  Therefore we may just get max_vf
     using VF of original loop.  */

I have added some comments to make it clearer.

+static tree
+replace_ops (tree op, hash_map<tree, tree> &mapping)
+{

I'm quite sure I've seen such beast elsewhere ;)  simplify_replace_tree
comes up first (not a 1:1 match but hints at a possible tree
sharing issue in your variant).


The reason I couldn't use simplify_replace_tree is because I didn't know what the "OLD" value is at the time I want to call it. Basically I want to check whether an SSA name is a key in MAPPING and if so replace it with the corresponding VALUE.

I have changed simplify_replace_tree such that valueize can take a context parameter. I replaced one use of replace_ops with it and the other I specialized as I found that it was always a MEM_REF and we needed to replace the address it was dereferencing.


+  tree advance;
    epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1,
&niters_vector,
                               &step_vector, &niters_vector_mult_vf, th,
-                             check_profitability, niters_no_overflow);
+                             check_profitability, niters_no_overflow,
+                             &advance);
+
+  if (epilogue)
+    {
+      basic_block *orig_bbs = get_loop_body (loop);
+      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
...

orig_stmts/drs/etc. in the epilogue loop_vinfo and ...

+      /* We are done vectorizing the main loop, so now we update the
epilogues
+        stmt_vec_info's.  At the same time we set the gimple UID of each
+        statement in the epilogue, as these are used to look them up in
the
+        epilogues loop_vec_info later.  We also keep track of what
...

split this out to a new function.  I wonder why you need to record
the DRs, are they not available via ->datarefs and lookup_dr ()?

lookup_dr may no longer work at this point. I found that for some memory accesses by the time I got to this point, the DR_STMT of the data_reference pointed to a scalar statement that no longer existed and the lookup_dr to that data reference ICE's. I can't make this update before we transform the loop because the data references are shared, so I decided to capture the dr_vec_info's instead. Apparently we don't ever do a lookup_dr past this point, which I must admit is surprising.

Still have to go over the main loop doing the analysis/transform.

Thanks, it looks really promising (albeit exepectedly ugly due to
the data rewriting).


Yeah, though I feel like now that I have put it away into functions it makes it look cleaner. That vect_transform_loop function was getting too big!

Is this OK for trunk?

gcc/ChangeLog:
2019-10-22  Andre Vieira  <andre.simoesdiasvie...@arm.com>

    PR 88915
    * gentype.c (main): Add poly_uint64 type and vector_sizes to
    generator.
    * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
    * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
    and make the valueize function pointer also take a void pointer.
    * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
    around vn_valueize, to call it without a context.
    (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
    * tree-vect-loop.c (vect_get_loop_niters): Make externally visible.
    (_loop_vec_info): Initialize epilogue_vinfos.
    (~_loop_vec_info): Release epilogue_vinfos.
    (vect_analyze_loop_costing): Use knowledge of main VF to estimate
    number of iterations of epilogue.
    (vect_analyze_loop_2): Adapt to analyse main loop for all supported
    vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
    versioning threshold needed for main loop.
    (vect_analyze_loop): Likewise.
    (find_in_mapping): New helper function.
    (update_epilogue_loop_vinfo): New function.
    (vect_transform_loop): When vectorizing epilogues re-use analysis done
    on main loop and call update_epilogue_loop_vinfo to update it.
    * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
    stmts on loop preheader edge.
    (vect_do_peeling): Enable skip-vectors when doing loop versioning if
    we decided to vectorize epilogues.  Update epilogues NITERS and
    construct ADVANCE to update epilogues data references where needed.
    * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos,
    epilogue_vsizes and update_epilogue_vinfo members.
    (LOOP_VINFO_UP_STMTS, LOOP_VINFO_UP_GT_DRS, LOOP_VINFO_UP_DRS,
     LOOP_VINFO_EPILOGUE_SIZES): Define MACROs.
    (vect_do_peeling, vect_get_loop_niters, vect_update_inits_of_drs,
determine_peel_for_niter, vect_analyze_loop): Add or update declarations.
    * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
created loop_vec_info's for epilogues when available. Otherwise analyse
    epilogue separately.
diff --git a/gcc/gengtype.c b/gcc/gengtype.c
index 53317337cf8c8e8caefd6b819d28b3bba301e755..80fb6ef71465b24e034fa45d69fec56be6b2e7f8 100644
--- a/gcc/gengtype.c
+++ b/gcc/gengtype.c
@@ -5197,6 +5197,7 @@ main (int argc, char **argv)
       POS_HERE (do_scalar_typedef ("widest_int", &pos));
       POS_HERE (do_scalar_typedef ("int64_t", &pos));
       POS_HERE (do_scalar_typedef ("poly_int64", &pos));
+      POS_HERE (do_scalar_typedef ("poly_uint64", &pos));
       POS_HERE (do_scalar_typedef ("uint64_t", &pos));
       POS_HERE (do_scalar_typedef ("uint8", &pos));
       POS_HERE (do_scalar_typedef ("uintptr_t", &pos));
@@ -5206,6 +5207,7 @@ main (int argc, char **argv)
       POS_HERE (do_scalar_typedef ("machine_mode", &pos));
       POS_HERE (do_scalar_typedef ("fixed_size_mode", &pos));
       POS_HERE (do_scalar_typedef ("CONSTEXPR", &pos));
+      POS_HERE (do_scalar_typedef ("vector_sizes", &pos));
       POS_HERE (do_typedef ("PTR", 
 			    create_pointer (resolve_typedef ("void", &pos)),
 			    &pos));
diff --git a/gcc/tree-ssa-loop-niter.h b/gcc/tree-ssa-loop-niter.h
index 4454c1ac78e02228047511a9e0214c82946855b8..aec6225125ce42ab0e4dbc930fc1a93862e6e267 100644
--- a/gcc/tree-ssa-loop-niter.h
+++ b/gcc/tree-ssa-loop-niter.h
@@ -53,7 +53,9 @@ extern bool scev_probably_wraps_p (tree, tree, tree, gimple *,
 				   class loop *, bool);
 extern void free_numbers_of_iterations_estimates (class loop *);
 extern void free_numbers_of_iterations_estimates (function *);
-extern tree simplify_replace_tree (tree, tree, tree, tree (*)(tree) = NULL);
+extern tree simplify_replace_tree (tree, tree,
+				   tree, tree (*)(tree, void *) = NULL,
+				   void * = NULL);
 extern void substitute_in_loop_info (class loop *, tree, tree);
 
 #endif /* GCC_TREE_SSA_LOOP_NITER_H */
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index cd2ced369719c37afd4aac08ff360719d7702e42..db666f019808850ed3a4aeef1a454a7ae2c65ef2 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -1935,7 +1935,7 @@ number_of_iterations_cond (class loop *loop,
 
 tree
 simplify_replace_tree (tree expr, tree old, tree new_tree,
-		       tree (*valueize) (tree))
+		       tree (*valueize) (tree, void*), void *context)
 {
   unsigned i, n;
   tree ret = NULL_TREE, e, se;
@@ -1951,7 +1951,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree,
     {
       if (TREE_CODE (expr) == SSA_NAME)
 	{
-	  new_tree = valueize (expr);
+	  new_tree = valueize (expr, context);
 	  if (new_tree != expr)
 	    return new_tree;
 	}
@@ -1967,7 +1967,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree,
   for (i = 0; i < n; i++)
     {
       e = TREE_OPERAND (expr, i);
-      se = simplify_replace_tree (e, old, new_tree, valueize);
+      se = simplify_replace_tree (e, old, new_tree, valueize, context);
       if (e == se)
 	continue;
 
diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c
index 57331ab44dc78c16d97065cd28e8c4cdcbf8d96e..0abe3fb8453ecf2e25ff55c5c9846663f68f7c8c 100644
--- a/gcc/tree-ssa-sccvn.c
+++ b/gcc/tree-ssa-sccvn.c
@@ -309,6 +309,10 @@ static vn_tables_t valid_info;
 /* Valueization hook.  Valueize NAME if it is an SSA name, otherwise
    just return it.  */
 tree (*vn_valueize) (tree);
+tree vn_valueize_wrapper (tree t, void* context ATTRIBUTE_UNUSED)
+{
+  return vn_valueize (t);
+}
 
 
 /* This represents the top of the VN lattice, which is the universal
@@ -6407,7 +6411,7 @@ process_bb (rpo_elim &avail, basic_block bb,
       if (bb->loop_father->nb_iterations)
 	bb->loop_father->nb_iterations
 	  = simplify_replace_tree (bb->loop_father->nb_iterations,
-				   NULL_TREE, NULL_TREE, vn_valueize);
+				   NULL_TREE, NULL_TREE, &vn_valueize_wrapper);
     }
 
   /* Value-number all defs in the basic-block.  */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index a2902267c62889a63af09d121a631e6d8c6f69d5..cd13d46a6a85f1f0111e97d0877feb33e401e45d 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1724,7 +1724,7 @@ vect_update_init_of_dr (struct data_reference *dr, tree niters, tree_code code)
    Apply vect_update_inits_of_dr to all accesses in LOOP_VINFO.
    CODE and NITERS are as for vect_update_inits_of_dr.  */
 
-static void
+void
 vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 			  tree_code code)
 {
@@ -1734,21 +1734,12 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 
   DUMP_VECT_SCOPE ("vect_update_inits_of_dr");
 
-  /* Adjust niters to sizetype and insert stmts on loop preheader edge.  */
+  /* Adjust niters to sizetype.  We used to insert the stmts on loop preheader
+     here, but since we might use these niters to update the epilogues niters
+     and data references we can't insert them here as this definition might not
+     always dominate its uses.  */
   if (!types_compatible_p (sizetype, TREE_TYPE (niters)))
-    {
-      gimple_seq seq;
-      edge pe = loop_preheader_edge (LOOP_VINFO_LOOP (loop_vinfo));
-      tree var = create_tmp_var (sizetype, "prolog_loop_adjusted_niters");
-
-      niters = fold_convert (sizetype, niters);
-      niters = force_gimple_operand (niters, &seq, false, var);
-      if (seq)
-	{
-	  basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, seq);
-	  gcc_assert (!new_bb);
-	}
-    }
+    niters = fold_convert (sizetype, niters);
 
   FOR_EACH_VEC_ELT (datarefs, i, dr)
     {
@@ -2391,7 +2382,22 @@ slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
 
    Note this function peels prolog and epilog only if it's necessary,
    as well as guards.
-   Returns created epilogue or NULL.
+   This function returns the epilogue loop if a decision was made to vectorize
+   it, otherwise NULL.
+
+   The analysis resulting in this epilogue loop's loop_vec_info was performed
+   in the same vect_analyze_loop call as the main loop's.  At that time
+   vect_analyze_loop constructs a list of accepted loop_vec_info's for lower
+   vectorization factors than the main loop.  This list is stored in the main
+   loop's loop_vec_info in the 'epilogue_vinfo' member.  Everytime we decide to
+   vectorize the epilogue loop for a lower vectorization factor,  the
+   loop_vec_info sitting at the top of the epilogue_vinfo list is removed,
+   updated and linked to the epilogue loop.  This is later used to vectorize
+   the epilogue.  The reason the loop_vec_info needs updating is that it was
+   constructed based on the original main loop, and the epilogue loop is a
+   copy of this loop, so all links pointing to statements in the original loop
+   need updating.  Furthermore, these loop_vec_info's share the
+   data_reference's records, which will also need to be updated.
 
    TODO: Guard for prefer_scalar_loop should be emitted along with
    versioning conditions if loop versioning is needed.  */
@@ -2401,14 +2407,18 @@ class loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, tree *step_vector,
 		 tree *niters_vector_mult_vf_var, int th,
-		 bool check_profitability, bool niters_no_overflow)
+		 bool check_profitability, bool niters_no_overflow,
+		 tree *advance)
 {
   edge e, guard_e;
-  tree type = TREE_TYPE (niters), guard_cond;
+  tree type = TREE_TYPE (niters), guard_cond, vector_guard = NULL;
   basic_block guard_bb, guard_to;
   profile_probability prob_prolog, prob_vector, prob_epilog;
   int estimated_vf;
   int prolog_peeling = 0;
+  bool vect_epilogues
+    = loop_vinfo->epilogue_vinfos.length () > 0
+    && !LOOP_VINFO_EPILOGUE_P (loop_vinfo);
   /* We currently do not support prolog peeling if the target alignment is not
      known at compile time.  'vect_gen_prolog_loop_niters' depends on the
      target alignment being constant.  */
@@ -2466,15 +2476,65 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   else
     niters_prolog = build_int_cst (type, 0);
 
+  loop_vec_info epilogue_vinfo = NULL;
+  if (vect_epilogues)
+    {
+      /* Take the next epilogue_vinfo to vectorize for.  */
+      epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
+      loop_vinfo->epilogue_vinfos.ordered_remove (0);
+
+      /* Don't vectorize epilogues if this is not the most inner loop or if
+	 the epilogue may need peeling for alignment as the vectorizer doesn't
+	 know how to handle these situations properly yet.  */
+      if (loop->inner != NULL
+	  || LOOP_VINFO_PEELING_FOR_ALIGNMENT (epilogue_vinfo))
+	vect_epilogues = false;
+
+    }
+
+  tree niters_vector_mult_vf;
+  unsigned int lowest_vf = constant_lower_bound (vf);
+  /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
+     on niters already ajusted for the iterations of the prologue.  */
+  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && known_eq (vf, lowest_vf))
+    {
+      loop_vec_info orig_loop_vinfo;
+      if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+	orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
+      else
+	orig_loop_vinfo = loop_vinfo;
+      vector_sizes vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo);
+      unsigned next_size = 0;
+      unsigned HOST_WIDE_INT eiters
+	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
+	   - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
+
+      if (prolog_peeling > 0)
+	eiters -= prolog_peeling;
+      eiters
+	= eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
+
+      unsigned int ratio;
+      while (next_size < vector_sizes.length ()
+	     && !(constant_multiple_p (current_vector_size,
+				       vector_sizes[next_size], &ratio)
+		  && eiters >= lowest_vf / ratio))
+	next_size += 1;
+
+      if (next_size == vector_sizes.length ())
+	vect_epilogues = false;
+    }
+
   /* Prolog loop may be skipped.  */
   bool skip_prolog = (prolog_peeling != 0);
   /* Skip to epilog if scalar loop may be preferred.  It's only needed
-     when we peel for epilog loop and when it hasn't been checked with
-     loop versioning.  */
+     when we peel for epilog loop or when we loop version.  */
   bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 		      ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
 				  bound_prolog + bound_epilog)
-		      : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
+		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+			 || vect_epilogues));
   /* Epilog loop must be executed if the number of iterations for epilog
      loop is known at compile time, otherwise we need to add a check at
      the end of vector loop and skip to the end of epilog loop.  */
@@ -2504,6 +2564,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 
   dump_user_location_t loop_loc = find_loop_location (loop);
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
+  if (vect_epilogues)
+    /* Make sure to set the epilogue's epilogue scalar loop, such that we can
+       we can use the original scalar loop as remaining epilogue if
+       necessary.  */
+    LOOP_VINFO_SCALAR_LOOP (epilogue_vinfo)
+      = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
+
   if (prolog_peeling)
     {
       e = loop_preheader_edge (loop);
@@ -2584,14 +2651,22 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 			   "loop can't be duplicated to exit edge.\n");
 	  gcc_unreachable ();
 	}
-      /* Peel epilog and put it on exit edge of loop.  */
-      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e);
+      /* Peel epilog and put it on exit edge of loop.  If we are vectorizing
+	 said epilog then we should use a copy of the main loop as a starting
+	 point.  This loop may have been already had some preliminary
+	 transformations to allow for more optimal vectorizationg, for example
+	 if-conversion.  If we are not vectorizing the epilog then we should
+	 use the scalar loop as the transformations mentioned above make less
+	 or no sense when not vectorizing.  */
+      epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
+      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
       if (!epilog)
 	{
 	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, loop_loc,
 			   "slpeel_tree_duplicate_loop_to_edge_cfg failed.\n");
 	  gcc_unreachable ();
 	}
+
       epilog->force_vectorize = false;
       slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
 
@@ -2608,6 +2683,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 						check_profitability);
 	  /* Build guard against NITERSM1 since NITERS may overflow.  */
 	  guard_cond = fold_build2 (LT_EXPR, boolean_type_node, nitersm1, t);
+	  vector_guard = guard_cond;
 	  guard_bb = anchor;
 	  guard_to = split_edge (loop_preheader_edge (epilog));
 	  guard_e = slpeel_add_loop_guard (guard_bb, guard_cond,
@@ -2635,7 +2711,6 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	}
 
       basic_block bb_before_epilog = loop_preheader_edge (epilog)->src;
-      tree niters_vector_mult_vf;
       /* If loop is peeled for non-zero constant times, now niters refers to
 	 orig_niters - prolog_peeling, it won't overflow even the orig_niters
 	 overflows.  */
@@ -2699,10 +2774,163 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
       adjust_vec_debug_stmts ();
       scev_reset ();
     }
+
+  if (vect_epilogues)
+    {
+      epilog->aux = epilogue_vinfo;
+      LOOP_VINFO_LOOP (epilogue_vinfo) = epilog;
+
+      loop_constraint_clear (epilog, LOOP_C_INFINITE);
+
+      /* We now must calculate the number of iterations for our epilogue.  */
+      tree cond_niters, niters;
+
+      /* Depending on whether we peel for gaps we take niters or niters - 1,
+	 we will refer to this as N - G, where N and G are the NITERS and
+	 GAP for the original loop.  */
+      niters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	? LOOP_VINFO_NITERSM1 (loop_vinfo)
+	: LOOP_VINFO_NITERS (loop_vinfo);
+
+      /* Here we build a vector factorization mask:
+	 vf_mask = ~(VF - 1), where VF is the Vectorization Factor.  */
+      tree vf_mask = build_int_cst (TREE_TYPE (niters),
+				    LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+      vf_mask = fold_build2 (MINUS_EXPR, TREE_TYPE (vf_mask),
+			     vf_mask,
+			     build_one_cst (TREE_TYPE (vf_mask)));
+      vf_mask = fold_build1 (BIT_NOT_EXPR, TREE_TYPE (niters), vf_mask);
+
+      /* Here we calculate:
+	 niters = N - ((N-G) & ~(VF -1)) */
+      niters = fold_build2 (MINUS_EXPR, TREE_TYPE (niters),
+			    LOOP_VINFO_NITERS (loop_vinfo),
+			    fold_build2 (BIT_AND_EXPR, TREE_TYPE (niters),
+					 niters,
+					 vf_mask));
+
+      if (skip_vector)
+	{
+	  /* If it is not guaranteed we enter the main loop we need to
+	     make the niters of the epilogue conditional on entireing the main
+	     loop.  We do this by constructing:
+	     cond_niters = !do_we_enter_main_loop ? N + niters_prolog : niters
+	     we add niters_prolog, the number of peeled iterations in the
+	     prolog, to N in case we don't enter the main loop, as these have
+	     already been subtracted from N (the number of iterations of the
+	     main loop).  Since the prolog peeling is also skipped if we skip the
+	     main loop we must add those interations back.  */
+	  cond_niters
+	    = fold_build3 (COND_EXPR, TREE_TYPE (niters),
+			   vector_guard,
+			   fold_build2 (PLUS_EXPR, TREE_TYPE (niters),
+					LOOP_VINFO_NITERS (loop_vinfo),
+					fold_convert (TREE_TYPE (niters),
+						      niters_prolog)),
+			   niters);
+	}
+      else
+	cond_niters = niters;
+
+      LOOP_VINFO_NITERS (epilogue_vinfo) = cond_niters;
+      LOOP_VINFO_NITERSM1 (epilogue_vinfo)
+	= fold_build2 (MINUS_EXPR, TREE_TYPE (cond_niters),
+		       cond_niters, build_one_cst (TREE_TYPE (cond_niters)));
+
+      /* We now calculate the amount of iterations we must advance our
+	 epilogue's data references by.  Make sure to use sizetype here as
+	 otherwise the pointer computation may go wrong on targets with
+	 different pointer sizes to the used niters type.  */
+      *advance = fold_convert (sizetype, niters);
+
+      *advance = fold_build2 (MINUS_EXPR, TREE_TYPE (*advance),
+			      *advance,
+			      fold_convert (sizetype,
+					    LOOP_VINFO_NITERS (loop_vinfo)));
+      *advance = fold_build2 (MINUS_EXPR, TREE_TYPE (*advance),
+			      build_zero_cst (TREE_TYPE (*advance)),
+			      *advance);
+
+      if (skip_vector)
+	{
+	  /* If we are skipping the vectorized loop then we must roll back the
+	     data references by the amount we might have expected to peel in
+	     the, also skipped, prolog.  */
+	  *advance
+	    = fold_build3 (COND_EXPR, TREE_TYPE (*advance),
+			   vector_guard,
+			   fold_build2 (MINUS_EXPR, TREE_TYPE (*advance),
+					build_zero_cst (TREE_TYPE (*advance)),
+					fold_convert (TREE_TYPE (*advance),
+						      niters_prolog)),
+			   *advance);
+	}
+
+      /* Redo the peeling for niter analysis as the NITERs and alignment
+	 may have been updated to take the main loop into account.  */
+      determine_peel_for_niter (epilogue_vinfo);
+    }
+
   adjust_vec.release ();
   free_original_copy_tables ();
 
-  return epilog;
+  if (vect_epilogues)
+    {
+      basic_block *bbs = get_loop_body (loop);
+      loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilog);
+
+      LOOP_VINFO_UP_STMTS (epilogue_vinfo).create (0);
+      LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).create (0);
+      LOOP_VINFO_UP_DRS (epilogue_vinfo).create (0);
+
+      gimple_stmt_iterator gsi;
+      gphi_iterator phi_gsi;
+      gimple *stmt;
+      stmt_vec_info stmt_vinfo;
+      dr_vec_info *dr_vinfo;
+
+      /* The stmt_vec_info's of the epilogue were constructed for the main loop
+	 and need to be updated to refer to the cloned variables used in the
+	 epilogue loop.  We do this by assuming the original main loop and the
+	 epilogue loop are identical (aside the different SSA names).  This
+	 means we assume we can go through each BB in the loop and each STMT in
+	 each BB and map them 1:1, replacing the STMT_VINFO_STMT of each
+	 stmt_vec_info in the epilogue's loop_vec_info.  Here we only keep
+	 track of the original state of the main loop, before vectorization.
+	 After vectorization we proceed to update the epilogue's stmt_vec_infos
+	 information.  We also update the references in PATTERN_DEF_SEQ's,
+	 RELATED_STMT's and data_references.  Mainly the latter has to be
+	 updated after we are done vectorizing the main loop, as the
+	 data_references are shared between main and epilogue.  */
+      for (unsigned i = 0; i < loop->num_nodes; ++i)
+	{
+	  for (phi_gsi = gsi_start_phis (bbs[i]);
+	       !gsi_end_p (phi_gsi); gsi_next (&phi_gsi))
+	    LOOP_VINFO_UP_STMTS (epilogue_vinfo).safe_push (phi_gsi.phi ());
+	  for (gsi = gsi_start_bb (bbs[i]);
+	       !gsi_end_p (gsi); gsi_next (&gsi))
+	    {
+	      stmt = gsi_stmt (gsi);
+	      LOOP_VINFO_UP_STMTS (epilogue_vinfo).safe_push (stmt);
+	      stmt_vinfo  = epilogue_vinfo->lookup_stmt (stmt);
+	      if (stmt_vinfo != NULL
+		  && stmt_vinfo->dr_aux.stmt == stmt_vinfo)
+		{
+		  dr_vinfo = STMT_VINFO_DR_INFO (stmt_vinfo);
+		  /* Data references pointing to gather loads and scatter stores
+		     require special treatment because the address computation
+		     happens in a different gimple node, pointed to by DR_REF.
+		     In contrast to normal loads and stores where we only need
+		     to update the offset of the data reference.  */
+		  if (STMT_VINFO_GATHER_SCATTER_P (dr_vinfo->stmt))
+		    LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).safe_push (dr_vinfo);
+		  LOOP_VINFO_UP_DRS (epilogue_vinfo).safe_push (dr_vinfo);
+		}
+	    }
+	}
+    }
+
+  return vect_epilogues ? epilog : NULL;
 }
 
 /* Function vect_create_cond_for_niters_checks.
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 72b80f46b1a9fa0bc8392809c286b5fac9a74451..81a5576a13004248d15db80145652d37f432c695 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -715,7 +715,7 @@ vect_fixup_scalar_cycles_with_patterns (loop_vec_info loop_vinfo)
    Return the loop exit condition.  */
 
 
-static gcond *
+gcond *
 vect_get_loop_niters (class loop *loop, tree *assumptions,
 		      tree *number_of_iterations, tree *number_of_iterationsm1)
 {
@@ -886,6 +886,9 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
 	    }
 	}
     }
+
+  epilogue_vinfos.create (6);
+  epilogue_vsizes.create (8);
 }
 
 /* Free all levels of MASKS.  */
@@ -910,6 +913,8 @@ _loop_vec_info::~_loop_vec_info ()
   release_vec_loop_masks (&masks);
   delete ivexpr_map;
   delete scan_map;
+  epilogue_vinfos.release ();
+  epilogue_vsizes.release ();
 
   loop->aux = NULL;
 }
@@ -1683,9 +1688,20 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
       return 0;
     }
 
-  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
-  if (estimated_niter == -1)
-    estimated_niter = likely_max_stmt_executions_int (loop);
+  HOST_WIDE_INT estimated_niter;
+
+  /* If we are vectorizing an epilogue then we know the maximum number of
+     scalar iterations it will cover is at least one lower than the
+     vectorization factor of the main loop.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    estimated_niter
+      = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;
+  else
+    {
+      estimated_niter = estimated_stmt_executions_int (loop);
+      if (estimated_niter == -1)
+	estimated_niter = likely_max_stmt_executions_int (loop);
+    }
   if (estimated_niter != -1
       && ((unsigned HOST_WIDE_INT) estimated_niter
 	  < MAX (th, (unsigned) min_profitable_estimate)))
@@ -1872,6 +1888,15 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
   int res;
   unsigned int max_vf = MAX_VECTORIZATION_FACTOR;
   poly_uint64 min_vf = 2;
+  loop_vec_info orig_loop_vinfo = NULL;
+
+  /* If we are dealing with an epilogue then orig_loop_vinfo points to the
+     loop_vec_info of the first vectorized loop.  */
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+    orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
+  else
+    orig_loop_vinfo = loop_vinfo;
+  gcc_assert (orig_loop_vinfo);
 
   /* The first group of checks is independent of the vector size.  */
   fatal = true;
@@ -2151,8 +2176,18 @@ start_over:
   /* During peeling, we need to check if number of loop iterations is
      enough for both peeled prolog loop and vector loop.  This check
      can be merged along with threshold check of loop versioning, so
-     increase threshold for this case if necessary.  */
-  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
+     increase threshold for this case if necessary.
+
+     If we are analyzing an epilogue we still want to check what it's
+     versioning threshold would be.  If we decide to vectorize the epilogues we
+     will want to use the lowest versioning threshold of all epilogues and main
+     loop.  This will enable us to enter a vectorized epilogue even when
+     versioning the loop.  We can't simply check whether the epilogue requires
+     versioning though since we may have skipped some versioning checks when
+     analyzing the epilogue. For instance, checks for alias versioning will be
+     skipped when dealing with epilogues as we assume we already checked them
+     for the main loop.  So instead we always check the 'orig_loop_vinfo'.  */
+  if (LOOP_REQUIRES_VERSIONING (orig_loop_vinfo))
     {
       poly_uint64 niters_th = 0;
       unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
@@ -2307,14 +2342,8 @@ again:
    be vectorized.  */
 opt_loop_vec_info
 vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
-		   vec_info_shared *shared)
+		   vec_info_shared *shared, vector_sizes vector_sizes)
 {
-  auto_vector_sizes vector_sizes;
-
-  /* Autodetect first vector size we try.  */
-  current_vector_size = 0;
-  targetm.vectorize.autovectorize_vector_sizes (&vector_sizes,
-						loop->simdlen != 0);
   unsigned int next_size = 0;
 
   DUMP_VECT_SCOPE ("analyze_loop_nest");
@@ -2335,6 +2364,9 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
   poly_uint64 autodetected_vector_size = 0;
   opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL);
   poly_uint64 first_vector_size = 0;
+  poly_uint64 lowest_th = 0;
+  unsigned vectorized_loops = 0;
+  bool vect_epilogues = !loop->simdlen && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK);
   while (1)
     {
       /* Check the CFG characteristics of the loop (nesting, entry/exit).  */
@@ -2353,24 +2385,52 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 
       if (orig_loop_vinfo)
 	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = orig_loop_vinfo;
+      else if (vect_epilogues && first_loop_vinfo)
+	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
 
       opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts);
       if (res)
 	{
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
+	  vectorized_loops++;
 
-	  if (loop->simdlen
-	      && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
-			   (unsigned HOST_WIDE_INT) loop->simdlen))
+	  if ((loop->simdlen
+	       && maybe_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			    (unsigned HOST_WIDE_INT) loop->simdlen))
+	      || vect_epilogues)
 	    {
 	      if (first_loop_vinfo == NULL)
 		{
 		  first_loop_vinfo = loop_vinfo;
+		  lowest_th
+		    = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
 		  first_vector_size = current_vector_size;
 		  loop->aux = NULL;
 		}
 	      else
-		delete loop_vinfo;
+		{
+		  /* Keep track of vector sizes that we know we can vectorize
+		     the epilogue with.  */
+		  if (vect_epilogues)
+		    {
+		      loop->aux = NULL;
+		      first_loop_vinfo->epilogue_vsizes.reserve (1);
+		      first_loop_vinfo->epilogue_vsizes.quick_push (current_vector_size);
+		      first_loop_vinfo->epilogue_vinfos.reserve (1);
+		      first_loop_vinfo->epilogue_vinfos.quick_push (loop_vinfo);
+		      LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
+		      poly_uint64 th
+			= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+		      gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+				  || maybe_ne (lowest_th, 0U));
+		      /* Keep track of the known smallest versioning
+			 threshold.  */
+		      if (ordered_p (lowest_th, th))
+			lowest_th = ordered_min (lowest_th, th);
+		    }
+		  else
+		    delete loop_vinfo;
+		}
 	    }
 	  else
 	    {
@@ -2408,6 +2468,8 @@ vect_analyze_loop (class loop *loop, loop_vec_info orig_loop_vinfo,
 		  dump_dec (MSG_NOTE, current_vector_size);
 		  dump_printf (MSG_NOTE, "\n");
 		}
+	      LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo) = lowest_th;
+
 	      return first_loop_vinfo;
 	    }
 	  else
@@ -8128,6 +8190,188 @@ vect_transform_loop_stmt (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
     *seen_store = stmt_info;
 }
 
+/* Helper function to pass to simplify_replace_tree to enable replacing tree's
+   in the hash_map with its corresponding values.  */
+static tree
+find_in_mapping (tree t, void *context)
+{
+  hash_map<tree,tree>* mapping = (hash_map<tree, tree>*) context;
+
+  tree *value = mapping->get (t);
+  return value ? *value : t;
+}
+
+static void
+update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
+{
+  loop_vec_info epilogue_vinfo = loop_vec_info_for_loop (epilogue);
+  auto_vec<stmt_vec_info> pattern_worklist, related_worklist;
+  hash_map<tree,tree> mapping;
+  gimple *orig_stmt, *new_stmt;
+  gimple_stmt_iterator epilogue_gsi;
+  gphi_iterator epilogue_phi_gsi;
+  stmt_vec_info stmt_vinfo = NULL, related_vinfo;
+  basic_block *epilogue_bbs = get_loop_body (epilogue);
+
+  LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs;
+
+  vect_update_inits_of_drs (epilogue_vinfo, advance, PLUS_EXPR);
+
+
+  /* We are done vectorizing the main loop, so now we update the epilogues
+     stmt_vec_info's.  At the same time we set the gimple UID of each
+     statement in the epilogue, as these are used to look them up in the
+     epilogues loop_vec_info later.  We also keep track of what
+     stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might
+     need updating and we construct a mapping between variables defined in
+     the main loop and their corresponding names in epilogue.  */
+  for (unsigned i = 0; i < epilogue->num_nodes; ++i)
+    {
+      for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);
+	   !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi))
+	{
+	  orig_stmt = LOOP_VINFO_UP_STMTS (epilogue_vinfo)[0];
+	  LOOP_VINFO_UP_STMTS (epilogue_vinfo).ordered_remove (0);
+	  new_stmt = epilogue_phi_gsi.phi ();
+
+	  stmt_vinfo
+	    = epilogue_vinfo->lookup_stmt (orig_stmt);
+
+	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+	  gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
+
+	  mapping.put (gimple_phi_result (orig_stmt),
+			gimple_phi_result (new_stmt));
+
+	  if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+	    pattern_worklist.safe_push (stmt_vinfo);
+
+	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+	  while (related_vinfo && related_vinfo != stmt_vinfo)
+	    {
+	      related_worklist.safe_push (related_vinfo);
+	      /* Set BB such that the assert in
+		'get_initial_def_for_reduction' is able to determine that
+		the BB of the related stmt is inside this loop.  */
+	      gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
+			     gimple_bb (new_stmt));
+	      related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+	    }
+	}
+
+      for (epilogue_gsi = gsi_start_bb (epilogue_bbs[i]);
+	   !gsi_end_p (epilogue_gsi); gsi_next (&epilogue_gsi))
+	{
+	  orig_stmt = LOOP_VINFO_UP_STMTS (epilogue_vinfo)[0];
+	  LOOP_VINFO_UP_STMTS (epilogue_vinfo).ordered_remove (0);
+	  new_stmt = gsi_stmt (epilogue_gsi);
+
+	  stmt_vinfo
+	    = epilogue_vinfo->lookup_stmt (orig_stmt);
+
+	  STMT_VINFO_STMT (stmt_vinfo) = new_stmt;
+	  gimple_set_uid (new_stmt, gimple_uid (orig_stmt));
+
+	  if (is_gimple_assign (orig_stmt))
+	    {
+	      gcc_assert (is_gimple_assign (new_stmt));
+	      mapping.put (gimple_assign_lhs (orig_stmt),
+			  gimple_assign_lhs (new_stmt));
+	    }
+
+	  if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+	    pattern_worklist.safe_push (stmt_vinfo);
+
+	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+	  related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+	  while (related_vinfo && related_vinfo != stmt_vinfo)
+	    {
+	      related_worklist.safe_push (related_vinfo);
+	      /* Set BB such that the assert in
+		'get_initial_def_for_reduction' is able to determine that
+		the BB of the related stmt is inside this loop.  */
+	      gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
+			     gimple_bb (new_stmt));
+	      related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+	    }
+	}
+      gcc_assert (LOOP_VINFO_UP_STMTS (epilogue_vinfo).length () == 0);
+    }
+
+  /* The PATTERN_DEF_SEQ's in the epilogue were constructed using the
+     original main loop and thus need to be updated to refer to the cloned
+     variables used in the epilogue.  */
+  for (unsigned i = 0; i < pattern_worklist.length (); ++i)
+    {
+      gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (pattern_worklist[i]);
+      tree *new_op;
+
+      while (seq)
+	{
+	  for (unsigned j = 1; j < gimple_num_ops (seq); ++j)
+	    {
+	      tree op = gimple_op (seq, j);
+	      if ((new_op = mapping.get(op)))
+		gimple_set_op (seq, j, *new_op);
+	      else
+		{
+		  op = simplify_replace_tree (op, NULL_TREE, NULL_TREE,
+					 &find_in_mapping, &mapping);
+		  gimple_set_op (seq, j, op);
+		}
+	    }
+	  seq = seq->next;
+	}
+    }
+
+  /* Just like the PATTERN_DEF_SEQ's the RELATED_STMT's also need to be
+     updated.  */
+  for (unsigned i = 0; i < related_worklist.length (); ++i)
+    {
+      tree *new_t;
+      gimple * stmt = STMT_VINFO_STMT (related_worklist[i]);
+      for (unsigned j = 1; j < gimple_num_ops (stmt); ++j)
+	if ((new_t = mapping.get(gimple_op (stmt, j))))
+	  gimple_set_op (stmt, j, *new_t);
+    }
+
+  tree *new_op;
+  /* Data references for gather loads and scatter stores do not use the
+     updated offset we set using ADVANCE.  Instead we have to make sure the
+     reference in the data references point to the corresponding copy of
+     the original in the epilogue.  */
+  for (unsigned i = 0; i < LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).length (); ++i)
+    {
+      dr_vec_info *dr_vinfo = LOOP_VINFO_UP_GT_DRS (epilogue_vinfo)[i];
+      data_reference *dr = dr_vinfo->dr;
+      gcc_assert (dr);
+      gcc_assert (TREE_CODE (DR_REF (dr)) == MEM_REF);
+      new_op = mapping.get (TREE_OPERAND (DR_REF (dr), 0));
+
+      if (new_op)
+	{
+	  DR_REF (dr) = unshare_expr (DR_REF (dr));
+	  TREE_OPERAND (DR_REF (dr), 0) = *new_op;
+	  DR_STMT (dr_vinfo->dr) = SSA_NAME_DEF_STMT (*new_op);
+	}
+    }
+
+  /* The vector size of the epilogue is smaller than that of the main loop
+     so the alignment is either the same or lower. This means the dr will
+     thus by definition be aligned.  */
+  for (unsigned i = 0; i < LOOP_VINFO_UP_DRS (epilogue_vinfo).length (); ++i)
+    LOOP_VINFO_UP_DRS (epilogue_vinfo)[i]->base_misaligned = false;
+
+
+  LOOP_VINFO_UP_STMTS (epilogue_vinfo).release ();
+  LOOP_VINFO_UP_GT_DRS (epilogue_vinfo).release ();
+  LOOP_VINFO_UP_DRS (epilogue_vinfo).release ();
+
+  epilogue_vinfo->shared->datarefs_copy.release ();
+  epilogue_vinfo->shared->save_datarefs ();
+}
+
+
 /* Function vect_transform_loop.
 
    The analysis phase has determined that the loop is vectorizable.
@@ -8165,11 +8409,11 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   if (th >= vect_vf_for_cost (loop_vinfo)
       && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_NOTE, vect_location,
-			 "Profitability threshold is %d loop iterations.\n",
-                         th);
-      check_profitability = true;
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "Profitability threshold is %d loop iterations.\n",
+			   th);
+	check_profitability = true;
     }
 
   /* Make sure there exists a single-predecessor exit bb.  Do this before 
@@ -8213,9 +8457,13 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo) = niters;
   tree nitersm1 = unshare_expr (LOOP_VINFO_NITERSM1 (loop_vinfo));
   bool niters_no_overflow = loop_niters_no_overflow (loop_vinfo);
+  tree advance;
+
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
 			      &step_vector, &niters_vector_mult_vf, th,
-			      check_profitability, niters_no_overflow);
+			      check_profitability, niters_no_overflow,
+			      &advance);
+
   if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo)
       && LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo).initialized_p ())
     scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo),
@@ -8474,57 +8722,14 @@ vect_transform_loop (loop_vec_info loop_vinfo)
      since vectorized loop can have loop-carried dependencies.  */
   loop->safelen = 0;
 
-  /* Don't vectorize epilogue for epilogue.  */
-  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
-    epilogue = NULL;
-
-  if (!PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK))
-    epilogue = NULL;
-
   if (epilogue)
     {
-      auto_vector_sizes vector_sizes;
-      targetm.vectorize.autovectorize_vector_sizes (&vector_sizes, false);
-      unsigned int next_size = 0;
-
-      /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
-         on niters already ajusted for the iterations of the prologue.  */
-      if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-	  && known_eq (vf, lowest_vf))
-	{
-	  unsigned HOST_WIDE_INT eiters
-	    = (LOOP_VINFO_INT_NITERS (loop_vinfo)
-	       - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
-	  eiters
-	    = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
-	  epilogue->nb_iterations_upper_bound = eiters - 1;
-	  epilogue->any_upper_bound = true;
-
-	  unsigned int ratio;
-	  while (next_size < vector_sizes.length ()
-		 && !(constant_multiple_p (current_vector_size,
-					   vector_sizes[next_size], &ratio)
-		      && eiters >= lowest_vf / ratio))
-	    next_size += 1;
-	}
-      else
-	while (next_size < vector_sizes.length ()
-	       && maybe_lt (current_vector_size, vector_sizes[next_size]))
-	  next_size += 1;
+      update_epilogue_loop_vinfo (epilogue, advance);
 
-      if (next_size == vector_sizes.length ())
-	epilogue = NULL;
-    }
-
-  if (epilogue)
-    {
+      epilogue->simduid = loop->simduid;
       epilogue->force_vectorize = loop->force_vectorize;
       epilogue->safelen = loop->safelen;
       epilogue->dont_vectorize = false;
-
-      /* We may need to if-convert epilogue to vectorize it.  */
-      if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
-	tree_if_conversion (epilogue);
     }
 
   return epilogue;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index bdb6b87c7b2d61302c33b071f737ecea41c06d33..fecd22f14bf03edc39ef325d3d80bf258b99603d 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -561,8 +561,26 @@ public:
      this points to the original vectorized loop.  Otherwise NULL.  */
   _loop_vec_info *orig_loop_info;
 
+  vec<_loop_vec_info *> epilogue_vinfos;
+
+  /* Keep track of vector sizes we know we can vectorize the epilogue with.
+     Only the first vectorized loop keeps track of these, for all its possible
+     epilogues.  */
+  vector_sizes epilogue_vsizes;
+
+  struct
+  {
+    vec<gimple *, va_heap, vl_ptr> orig_stmts;
+    vec<dr_vec_info *,va_heap, vl_ptr> gather_scatter_drs;
+    vec<dr_vec_info *,va_heap, vl_ptr> drs;
+  } update_epilogue_vinfo;
+
 } *loop_vec_info;
 
+#define LOOP_VINFO_UP_STMTS(L)	(L)->update_epilogue_vinfo.orig_stmts
+#define LOOP_VINFO_UP_GT_DRS(L)	(L)->update_epilogue_vinfo.gather_scatter_drs
+#define LOOP_VINFO_UP_DRS(L)	(L)->update_epilogue_vinfo.drs
+
 /* Access Functions.  */
 #define LOOP_VINFO_LOOP(L)                 (L)->loop
 #define LOOP_VINFO_BBS(L)                  (L)->bbs
@@ -613,6 +631,7 @@ public:
 #define LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST(L) (L)->single_scalar_iteration_cost
 #define LOOP_VINFO_ORIG_LOOP_INFO(L)       (L)->orig_loop_info
 #define LOOP_VINFO_SIMD_IF_COND(L)         (L)->simd_if_cond
+#define LOOP_VINFO_EPILOGUE_SIZES(L)	   (L)->epilogue_vsizes
 
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L)	\
   ((L)->may_misalign_stmts.length () > 0)
@@ -1516,10 +1535,14 @@ class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
 						     class loop *, edge);
 class loop *vect_loop_versioning (loop_vec_info);
 extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
-				     tree *, tree *, tree *, int, bool, bool);
+				    tree *, tree *, tree *, int, bool, bool,
+				    tree *);
 extern void vect_prepare_for_masked_peels (loop_vec_info);
 extern dump_user_location_t find_loop_location (class loop *);
 extern bool vect_can_advance_ivs_p (loop_vec_info);
+extern gcond * vect_get_loop_niters (class loop *, tree *, tree *, tree *);
+extern void vect_update_inits_of_drs (loop_vec_info, tree, tree_code);
+
 
 /* In tree-vect-stmts.c.  */
 extern poly_uint64 current_vector_size;
@@ -1627,6 +1650,8 @@ extern tree vect_create_addr_base_for_vector_ref (stmt_vec_info, gimple_seq *,
 						  tree, tree = NULL_TREE);
 
 /* In tree-vect-loop.c.  */
+/* Used in tree-vect-loop-manip.c */
+extern void determine_peel_for_niter (loop_vec_info);
 extern widest_int vect_iv_limit_for_full_masking (loop_vec_info loop_vinfo);
 /* Used in gimple-loop-interchange.c and tree-parloops.c.  */
 extern bool check_reduction_path (dump_user_location_t, loop_p, gphi *, tree,
@@ -1634,7 +1659,8 @@ extern bool check_reduction_path (dump_user_location_t, loop_p, gphi *, tree,
 /* Drive for loop analysis stage.  */
 extern opt_loop_vec_info vect_analyze_loop (class loop *,
 					    loop_vec_info,
-					    vec_info_shared *);
+					    vec_info_shared *,
+					    vector_sizes);
 extern tree vect_build_loop_niters (loop_vec_info, bool * = NULL);
 extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *,
 					 tree *, bool);
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 3e8637f070d5cd526d6626d2b7ba1c5f9243ce0a..4dbb03cdcc8e9612083136d3ef9b5b16d6e30b13 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -875,6 +875,10 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
   vec_info_shared shared;
   auto_purge_vect_location sentinel;
   vect_location = find_loop_location (loop);
+  auto_vector_sizes auto_vector_sizes;
+  vector_sizes vector_sizes;
+  bool assert_versioning = false;
+
   if (LOCATION_LOCUS (vect_location.get_location_t ()) != UNKNOWN_LOCATION
       && dump_enabled_p ())
     dump_printf (MSG_NOTE | MSG_PRIORITY_INTERNALS,
@@ -882,10 +886,35 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 		 LOCATION_FILE (vect_location.get_location_t ()),
 		 LOCATION_LINE (vect_location.get_location_t ()));
 
+  /* If this is an epilogue, we already know what vector sizes we will use for
+     vectorization as the analyzis was part of the main vectorized loop.  Use
+     these instead of going through all vector sizes again.  */
+  if (orig_loop_vinfo
+      && !LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo).is_empty ())
+    {
+      vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo);
+      assert_versioning = LOOP_REQUIRES_VERSIONING (orig_loop_vinfo);
+      current_vector_size = vector_sizes[0];
+    }
+  else
+    {
+      /* Autodetect first vector size we try.  */
+      current_vector_size = 0;
+
+      targetm.vectorize.autovectorize_vector_sizes (&auto_vector_sizes,
+						    loop->simdlen != 0);
+      vector_sizes = auto_vector_sizes;
+    }
+
   /* Try to analyze the loop, retaining an opt_problem if dump_enabled_p.  */
-  opt_loop_vec_info loop_vinfo
-    = vect_analyze_loop (loop, orig_loop_vinfo, &shared);
-  loop->aux = loop_vinfo;
+  opt_loop_vec_info loop_vinfo = opt_loop_vec_info::success (NULL);
+  if (loop_vec_info_for_loop (loop))
+    loop_vinfo = opt_loop_vec_info::success (loop_vec_info_for_loop (loop));
+  else
+    {
+      loop_vinfo = vect_analyze_loop (loop, orig_loop_vinfo, &shared, vector_sizes);
+      loop->aux = loop_vinfo;
+    }
 
   if (!loop_vinfo)
     if (dump_enabled_p ())
@@ -898,6 +927,10 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))
     {
+      /* If this loops requires versioning, make sure the analyzis done on the
+	 epilogue loops succeeds.  */
+      gcc_assert (!assert_versioning);
+
       /* Free existing information if loop is analyzed with some
 	 assumptions.  */
       if (loop_constraint_set_p (loop, LOOP_C_FINITE))
@@ -1013,8 +1046,13 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab,
 
   /* Epilogue of vectorized loop must be vectorized too.  */
   if (new_loop)
-    ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,
-				 new_loop, loop_vinfo, NULL, NULL);
+    {
+      /* Don't include vectorized epilogues in the "vectorized loops" count.
+       */
+      unsigned dont_count = *num_vectorized_loops;
+      ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count,
+				   new_loop, loop_vinfo, NULL, NULL);
+    }
 
   return ret;
 }

Reply via email to