Re: [committed][vect]PR 88915: Vectorize epilogues when versioning loops (was Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops)

2019-10-29 Thread Richard Biener
On Tue, 29 Oct 2019, Andre Vieira (lists) wrote:

> Hi richi,
> 
> 
> Committed the patch with the suggested change in revision r277569.
> 
> I will look at the follow-up, from what I understood the idea was to add a
> field to the dr_vec_info where we keep the "extra-offset" that represents the
> position in the DR wrt the current loop/epilogue. So when we advance the DR,
> we change that field and not the DR_OFFSET? That way we always keep the
> "original state".

Yes.  That would apply to the adjustment done by prologue peeling as
well.  Still wonder how gather/scatter get by not having any adjustment ;)
Ah, I guess they simply are not affected by peeling since they get
their DR_OFFSET & friends computed in the loop (loaded from memory).
But DR_REF needs adjustment because it tries to walk def-stmts
(repeatedly in both analysis and transform phase).

Richard.

> Cheers,
> Andre
> 
> gcc/ChangeLog:
> 2019-10-29  Andre Vieira  
> 
> PR 88915
> * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
> * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
> and make the valueize function pointer also take a void pointer.
> * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
> around vn_valueize, to call it without a context.
> (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
> * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.
> (~_loop_vec_info): Release epilogue_vinfos.
> (vect_analyze_loop_costing): Use knowledge of main VF to estimate
> number of iterations of epilogue.
> (vect_analyze_loop_2): Adapt to analyse main loop for all supported
> vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
> versioning threshold needed for main loop.
> (vect_analyze_loop): Likewise.
> (find_in_mapping): New helper function.
> (update_epilogue_loop_vinfo): New function.
> (vect_transform_loop): When vectorizing epilogues re-use analysis done
> on main loop and call update_epilogue_loop_vinfo to update it.
> * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
> stmts on loop preheader edge.
> (vect_do_peeling): Enable skip-vectors when doing loop versioning if
> we decided to vectorize epilogues.  Update epilogues NITERS and
> construct ADVANCE to update epilogues data references where needed.
> * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.
> (vect_do_peeling, vect_update_inits_of_drs,
>  determine_peel_for_niter, vect_analyze_loop): Add or update declarations.
> * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
> created loop_vec_info's for epilogues when available.  Otherwise 
> analyse
> epilogue separately.
> 
> 
> 
> Cheers,
> Andre
> 
> On 29/10/2019 11:48, Richard Biener wrote:
> > On Mon, 28 Oct 2019, Andre Vieira (lists) wrote:
> > 
> >> Hi,
> >>
> >> Reworked according to your comments, see inline for clarification.
> >>
> >> Is this OK for trunk?
> > 
> > + gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo);
> > + while (seq)
> > +   {
> > + stmt_worklist.safe_push (seq);
> > + seq = seq->next;
> > +   }
> > 
> > you're supposed to do to the following, not access the ->next
> > implementation detail:
> > 
> >  for (gimple_stmt_iterator gsi = gsi_start (seq); !gsi_end_p (gsi);
> > gsi_next ())
> >stmt_worklist.safe_push (gsi_stmt (gsi));
> > 
> > 
> > +  /* Data references for gather loads and scatter stores do not use
> > the
> > +updated offset we set using ADVANCE.  Instead we have to make
> > sure
> > the
> > +reference in the data references point to the corresponding copy
> > of
> > +the original in the epilogue.  */
> > +  if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
> > +   {
> > + DR_REF (dr)
> > +   = simplify_replace_tree (DR_REF (dr), NULL_TREE, NULL_TREE,
> > +_in_mapping, );
> > + DR_BASE_ADDRESS (dr)
> > +   = simplify_replace_tree (DR_BASE_ADDRESS (dr), NULL_TREE,
> > NULL_TREE,
> > +_in_mapping, );
> > +   }
> > 
> > Hmm.  So for other DRs we account for the previous vector loop
> > by adjusting DR_OFFSET?  But STMT_VINFO_GATHER_SCATTER_P ends up
> > using (unconditionally) DR_REF here?  In that case it seems
> > best to adjust DR_REF only but NULL out DR_BASE_ADDRESS and
> > DR_OFFSET?  I wonder how prologue peeling deals with
> > STMT_VINFO_GATHER_SCATTER_P ... I see the caller of
> > vect_update_init_of_dr there does nothing for STMT_VINFO_GATHER_SCATTER_P.
> > 
> > I wonder if (as followup to not delay this further) we can
> > "offload" all the DR adjustment by storing ADVANCE in dr_vec_info
> > and accounting for that when we create the dataref pointers in
> > vectorizable_load/store?  That way 

[committed][vect]PR 88915: Vectorize epilogues when versioning loops (was Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops)

2019-10-29 Thread Andre Vieira (lists)
if (TREE_CODE (DR_REF (dr)) == ARRAY_REF)
+   j = 1;
+ else
+   gcc_unreachable ();
+
+ if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), j)))
+   {
+ DR_REF (dr) = unshare_expr (DR_REF (dr));
+ TREE_OPERAND (DR_REF (dr), j) = *new_op;
+   }

huh, do you really only ever see MEM_REF or ARRAY_REF here?
I would guess using simplify_replace_tree is safer.
There's also DR_BASE_ADDRESS - we seem to leave the DRs partially
updated, is that correct?


Yeah can use simplify_replace_tree indeed.  And I have changed it so it
updates DR_BASE_ADDRESS.  I think DR_BASE_ADDRESS never actually changed in
the way we use data_references... Either way, replacing them if they do change
is cleaner and more future proof.


Otherwise looks OK to me.

Thanks,
Richard.



gcc/ChangeLog:
2019-10-25  Andre Vieira  

  PR 88915
  * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
  * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
  and make the valueize function pointer also take a void pointer.
  * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
  around vn_valueize, to call it without a context.
  (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
  * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.
  (~_loop_vec_info): Release epilogue_vinfos.
  (vect_analyze_loop_costing): Use knowledge of main VF to estimate
  number of iterations of epilogue.
  (vect_analyze_loop_2): Adapt to analyse main loop for all supported
  vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
  versioning threshold needed for main loop.
  (vect_analyze_loop): Likewise.
  (find_in_mapping): New helper function.
  (update_epilogue_loop_vinfo): New function.
  (vect_transform_loop): When vectorizing epilogues re-use analysis done
  on main loop and call update_epilogue_loop_vinfo to update it.
  * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
  stmts on loop preheader edge.
  (vect_do_peeling): Enable skip-vectors when doing loop versioning if
  we decided to vectorize epilogues.  Update epilogues NITERS and
  construct ADVANCE to update epilogues data references where needed.
  * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.
  (vect_do_peeling, vect_update_inits_of_drs,
  determine_peel_for_niter, vect_analyze_loop): Add or update
  declarations.
  * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
  created loop_vec_info's for epilogues when available.  Otherwise
analyse
  epilogue separately.



Cheers,
Andre








>From 3aba91fb4d1a0dcfa04b8c8ead1941679cd46ae3 Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Fri, 18 Oct 2019 17:49:58 +0100
Subject: [PATCH 2/3] [PATCH 1/2][vect]PR 88915: Vectorize epilogues when
 versioning loops

---
 gcc/tree-ssa-loop-niter.c  |   6 +-
 gcc/tree-ssa-loop-niter.h  |   4 +-
 gcc/tree-ssa-sccvn.c   |   6 +-
 gcc/tree-vect-loop-manip.c | 203 ---
 gcc/tree-vect-loop.c   | 332 ++---
 gcc/tree-vectorizer.c  |  25 ++-
 gcc/tree-vectorizer.h  |  13 +-
 7 files changed, 490 insertions(+), 99 deletions(-)

diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index cd2ced369719c37afd4aac08ff360719d7702e42..db666f019808850ed3a4aeef1a454a7ae2c65ef2 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -1935,7 +1935,7 @@ number_of_iterations_cond (class loop *loop,
 
 tree
 simplify_replace_tree (tree expr, tree old, tree new_tree,
-		   tree (*valueize) (tree))
+		   tree (*valueize) (tree, void*), void *context)
 {
   unsigned i, n;
   tree ret = NULL_TREE, e, se;
@@ -1951,7 +1951,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree,
 {
   if (TREE_CODE (expr) == SSA_NAME)
 	{
-	  new_tree = valueize (expr);
+	  new_tree = valueize (expr, context);
 	  if (new_tree != expr)
 	return new_tree;
 	}
@@ -1967,7 +1967,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree,
   for (i = 0; i < n; i++)
 {
   e = TREE_OPERAND (expr, i);
-  se = simplify_replace_tree (e, old, new_tree, valueize);
+  se = simplify_replace_tree (e, old, new_tree, valueize, context);
   if (e == se)
 	continue;
 
diff --git a/gcc/tree-ssa-loop-niter.h b/gcc/tree-ssa-loop-niter.h
index 4454c1ac78e02228047511a9e0214c82946855b8..aec6225125ce42ab0e4dbc930fc1a93862e6e267 100644
--- a/gcc/tree-ssa-loop-niter.h
+++ b/gcc/tree-ssa-loop-niter.h
@@ -53,7 +53,9 @@ extern bool scev_probably_wraps_p (tree, tree, tree, gimple *,
    class loop *, bool);
 extern void free_numbers_of_iterations_estimates (class loop *);
 extern void free_numbers_of_iterations_estimates (function *);
-extern tree simplify_replace_tree (tree, tree, tree, tree (*)(tree) = NULL);
+extern tr

Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-10-29 Thread Richard Biener
On Mon, 28 Oct 2019, Andre Vieira (lists) wrote:

> Hi,
> 
> Reworked according to your comments, see inline for clarification.
> 
> Is this OK for trunk?

+ gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo);
+ while (seq)
+   {
+ stmt_worklist.safe_push (seq);
+ seq = seq->next;
+   }

you're supposed to do to the following, not access the ->next
implementation detail:

for (gimple_stmt_iterator gsi = gsi_start (seq); !gsi_end_p (gsi); 
gsi_next ())
  stmt_worklist.safe_push (gsi_stmt (gsi));


+  /* Data references for gather loads and scatter stores do not use 
the
+updated offset we set using ADVANCE.  Instead we have to make 
sure
the
+reference in the data references point to the corresponding copy 
of
+the original in the epilogue.  */
+  if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
+   {
+ DR_REF (dr)
+   = simplify_replace_tree (DR_REF (dr), NULL_TREE, NULL_TREE,
+_in_mapping, );
+ DR_BASE_ADDRESS (dr)
+   = simplify_replace_tree (DR_BASE_ADDRESS (dr), NULL_TREE,
NULL_TREE,
+_in_mapping, );
+   }

Hmm.  So for other DRs we account for the previous vector loop
by adjusting DR_OFFSET?  But STMT_VINFO_GATHER_SCATTER_P ends up
using (unconditionally) DR_REF here?  In that case it seems
best to adjust DR_REF only but NULL out DR_BASE_ADDRESS and
DR_OFFSET?  I wonder how prologue peeling deals with
STMT_VINFO_GATHER_SCATTER_P ... I see the caller of
vect_update_init_of_dr there does nothing for STMT_VINFO_GATHER_SCATTER_P.

I wonder if (as followup to not delay this further) we can
"offload" all the DR adjustment by storing ADVANCE in dr_vec_info
and accounting for that when we create the dataref pointers in
vectorizable_load/store?  That way we could avoid saving/restoring
DR_OFFSET as well. 

So, the patch is OK with the sequence iteration fixed.  I think
sorting out the above can be done as followup.

Thanks,
Richard.

> gcc/ChangeLog:
> 2019-10-28  Andre Vieira  
> 
> PR 88915
> * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
> * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
> and make the valueize function pointer also take a void pointer.
> * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
> around vn_valueize, to call it without a context.
> (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
> * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.
> (~_loop_vec_info): Release epilogue_vinfos.
> (vect_analyze_loop_costing): Use knowledge of main VF to estimate
> number of iterations of epilogue.
> (vect_analyze_loop_2): Adapt to analyse main loop for all supported
> vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
> versioning threshold needed for main loop.
> (vect_analyze_loop): Likewise.
> (find_in_mapping): New helper function.
> (update_epilogue_loop_vinfo): New function.
> (vect_transform_loop): When vectorizing epilogues re-use analysis done
> on main loop and call update_epilogue_loop_vinfo to update it.
> * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
> stmts on loop preheader edge.
> (vect_do_peeling): Enable skip-vectors when doing loop versioning if
> we decided to vectorize epilogues.  Update epilogues NITERS and
> construct ADVANCE to update epilogues data references where needed.
> * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.
> (vect_do_peeling, vect_update_inits_of_drs,
>  determine_peel_for_niter, vect_analyze_loop): Add or update declarations.
> * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
> created loop_vec_info's for epilogues when available.  Otherwise 
> analyse
> epilogue separately.
> 
> 
> 
> Cheers,
> Andre
> 
> On 28/10/2019 14:16, Richard Biener wrote:
> > On Fri, 25 Oct 2019, Andre Vieira (lists) wrote:
> > 
> >> Hi,
> >>
> >> This is the reworked patch after your comments.
> >>
> >> I have moved the epilogue check into the analysis form disguised under
> >> '!epilogue_vinfos.is_empty ()'.  This because I realized that I am doing
> >> the
> >> "lowest threshold" check there.
> >>
> >> The only place where we may reject an epilogue_vinfo is when we know the
> >> number of scalar iterations and we realize the number of iterations left
> >> after
> >> the main loop are not enough to enter the vectorized epilogue so we
> >> optimize
> >> away that code-gen.  The only way we know this to be true is if the number
> >> of
> >> scalar iterations are known and the peeling for alignment is known. So we
> >> know
> >> we will enter the main loop regardless, so whether the threshold we use is
> >> for
> >> a lower VF or not it shouldn't matter as much, I would even 

Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-10-28 Thread Andre Vieira (lists)

Hi,

Reworked according to your comments, see inline for clarification.

Is this OK for trunk?

gcc/ChangeLog:
2019-10-28  Andre Vieira  

PR 88915
* tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
* tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
and make the valueize function pointer also take a void pointer.
* gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
around vn_valueize, to call it without a context.
(process_bb): Use vn_valueize_wrapper instead of vn_valueize.
* tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.
(~_loop_vec_info): Release epilogue_vinfos.
(vect_analyze_loop_costing): Use knowledge of main VF to estimate
number of iterations of epilogue.
(vect_analyze_loop_2): Adapt to analyse main loop for all supported
vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
versioning threshold needed for main loop.
(vect_analyze_loop): Likewise.
(find_in_mapping): New helper function.
(update_epilogue_loop_vinfo): New function.
(vect_transform_loop): When vectorizing epilogues re-use analysis done
on main loop and call update_epilogue_loop_vinfo to update it.
* tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
stmts on loop preheader edge.
(vect_do_peeling): Enable skip-vectors when doing loop versioning if
we decided to vectorize epilogues.  Update epilogues NITERS and
construct ADVANCE to update epilogues data references where needed.
* tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.
(vect_do_peeling, vect_update_inits_of_drs,
 determine_peel_for_niter, vect_analyze_loop): Add or update 
declarations.

* tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
created loop_vec_info's for epilogues when available.  Otherwise 
analyse

epilogue separately.



Cheers,
Andre

On 28/10/2019 14:16, Richard Biener wrote:

On Fri, 25 Oct 2019, Andre Vieira (lists) wrote:


Hi,

This is the reworked patch after your comments.

I have moved the epilogue check into the analysis form disguised under
'!epilogue_vinfos.is_empty ()'.  This because I realized that I am doing the
"lowest threshold" check there.

The only place where we may reject an epilogue_vinfo is when we know the
number of scalar iterations and we realize the number of iterations left after
the main loop are not enough to enter the vectorized epilogue so we optimize
away that code-gen.  The only way we know this to be true is if the number of
scalar iterations are known and the peeling for alignment is known. So we know
we will enter the main loop regardless, so whether the threshold we use is for
a lower VF or not it shouldn't matter as much, I would even like to think that
check isn't done, but I am not sure... Might be worth checking as an
optimization.


Is this OK for trunk?


+  for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);
+  !gsi_end_p (epilogue_phi_gsi); gsi_next (_phi_gsi))
+   {
..
+ if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+   pattern_worklist.safe_push (stmt_vinfo);
+
+ related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+ while (related_vinfo && related_vinfo != stmt_vinfo)
+   {

I think PHIs cannot have patterns.  You can assert
that STMT_VINFO_RELATED_STMT is NULL I think.


Done.


+ related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+ while (related_vinfo && related_vinfo != stmt_vinfo)
+   {
+ related_worklist.safe_push (related_vinfo);
+ /* Set BB such that the assert in
+   'get_initial_def_for_reduction' is able to determine that
+   the BB of the related stmt is inside this loop.  */
+ gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
+gimple_bb (new_stmt));
+ related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+   }

do we really keep references to "nested" patterns?  Thus, do you
need this loop?


Changed and added asserts.  They didn't trigger so I suppose you are 
right, I didn't know at the time whether it was possible, so I just 
operated on the side of caution.  Can remove the asserts and so on if 
you want.


+  /* The PATTERN_DEF_SEQs in the epilogue were constructed using the
+ original main loop and thus need to be updated to refer to the
cloned
+ variables used in the epilogue.  */
+  for (unsigned i = 0; i < pattern_worklist.length (); ++i)
+{
...
+ op = simplify_replace_tree (op, NULL_TREE, NULL_TREE,
+_in_mapping, );
+ gimple_set_op (seq, j, op);

you do this for the pattern-def seq but not for the related one.
I guess you ran into this for COND_EXPR conditions.  I wondered
to use a shared worklist for both the def-seq and the main pattern
stmt or at least to split out the 

Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-10-28 Thread Richard Biener
On Fri, 25 Oct 2019, Andre Vieira (lists) wrote:

> Hi,
> 
> This is the reworked patch after your comments.
> 
> I have moved the epilogue check into the analysis form disguised under
> '!epilogue_vinfos.is_empty ()'.  This because I realized that I am doing the
> "lowest threshold" check there.
> 
> The only place where we may reject an epilogue_vinfo is when we know the
> number of scalar iterations and we realize the number of iterations left after
> the main loop are not enough to enter the vectorized epilogue so we optimize
> away that code-gen.  The only way we know this to be true is if the number of
> scalar iterations are known and the peeling for alignment is known. So we know
> we will enter the main loop regardless, so whether the threshold we use is for
> a lower VF or not it shouldn't matter as much, I would even like to think that
> check isn't done, but I am not sure... Might be worth checking as an
> optimization.
> 
> 
> Is this OK for trunk?

+  for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]);
+  !gsi_end_p (epilogue_phi_gsi); gsi_next (_phi_gsi))
+   {
..
+ if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo))
+   pattern_worklist.safe_push (stmt_vinfo);
+
+ related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+ while (related_vinfo && related_vinfo != stmt_vinfo)
+   {

I think PHIs cannot have patterns.  You can assert
that STMT_VINFO_RELATED_STMT is NULL I think.

+ related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo);
+ while (related_vinfo && related_vinfo != stmt_vinfo)
+   {
+ related_worklist.safe_push (related_vinfo);
+ /* Set BB such that the assert in
+   'get_initial_def_for_reduction' is able to determine that
+   the BB of the related stmt is inside this loop.  */
+ gimple_set_bb (STMT_VINFO_STMT (related_vinfo),
+gimple_bb (new_stmt));
+ related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo);
+   }

do we really keep references to "nested" patterns?  Thus, do you
need this loop?

+  /* The PATTERN_DEF_SEQs in the epilogue were constructed using the
+ original main loop and thus need to be updated to refer to the 
cloned
+ variables used in the epilogue.  */
+  for (unsigned i = 0; i < pattern_worklist.length (); ++i)
+{
...
+ op = simplify_replace_tree (op, NULL_TREE, NULL_TREE,
+_in_mapping, );
+ gimple_set_op (seq, j, op);

you do this for the pattern-def seq but not for the related one.
I guess you ran into this for COND_EXPR conditions.  I wondered
to use a shared worklist for both the def-seq and the main pattern
stmt or at least to split out the replacement so you can share it.

+  /* Data references for gather loads and scatter stores do not use 
the
+updated offset we set using ADVANCE.  Instead we have to make 
sure the
+reference in the data references point to the corresponding copy 
of
+the original in the epilogue.  */
+  if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
+   {
+ int j;
+ if (TREE_CODE (DR_REF (dr)) == MEM_REF)
+   j = 0;
+ else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF)
+   j = 1;
+ else
+   gcc_unreachable ();
+
+ if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), j)))
+   {
+ DR_REF (dr) = unshare_expr (DR_REF (dr));
+ TREE_OPERAND (DR_REF (dr), j) = *new_op;
+   }

huh, do you really only ever see MEM_REF or ARRAY_REF here?
I would guess using simplify_replace_tree is safer.
There's also DR_BASE_ADDRESS - we seem to leave the DRs partially
updated, is that correct?

Otherwise looks OK to me.

Thanks,
Richard.


> gcc/ChangeLog:
> 2019-10-25  Andre Vieira  
> 
> PR 88915
> * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
> * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
> and make the valueize function pointer also take a void pointer.
> * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
> around vn_valueize, to call it without a context.
> (process_bb): Use vn_valueize_wrapper instead of vn_valueize.
> * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.
> (~_loop_vec_info): Release epilogue_vinfos.
> (vect_analyze_loop_costing): Use knowledge of main VF to estimate
> number of iterations of epilogue.
> (vect_analyze_loop_2): Adapt to analyse main loop for all supported
> vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
> versioning threshold needed for main loop.
> (vect_analyze_loop): Likewise.
> (find_in_mapping): New helper function.
> (update_epilogue_loop_vinfo): New function.
> (vect_transform_loop): When vectorizing epilogues re-use analysis done
> on main 

Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-10-28 Thread Richard Biener
On Fri, 25 Oct 2019, Andre Vieira (lists) wrote:

> 
> 
> On 22/10/2019 14:56, Richard Biener wrote:
> > On Tue, 22 Oct 2019, Andre Vieira (lists) wrote:
> > 
> >> Hi Richi,
> >>
> >> See inline responses to your comments.
> >>
> >> On 11/10/2019 13:57, Richard Biener wrote:
> >>> On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:
> >>>
>  Hi,
> 
> >>
> >>>
> >>> +
> >>> +  /* Keep track of vector sizes we know we can vectorize the epilogue
> >>> with.  */
> >>> +  vector_sizes epilogue_vsizes;
> >>>};
> >>>
> >>> please don't enlarge struct loop, instead track this somewhere
> >>> in the vectorizer (in loop_vinfo?  I see you already have
> >>> epilogue_vinfos there - so the loop_vinfo simply lacks
> >>> convenient access to the vector_size?)  I don't see any
> >>> use that could be trivially adjusted to look at a loop_vinfo
> >>> member instead.
> >>
> >> Done.
> >>>
> >>> For the vect_update_inits_of_drs this means that we'd possibly
> >>> do less CSE.  Not sure if really an issue.
> >>
> >> CSE of what exactly? You are afraid we are repeating a calculation here we
> >> have done elsewhere before?
> > 
> > All uses of those inits now possibly get the expression instead of
> > just the SSA name we inserted code for once.  But as said, we'll see.
> > 
> 
> This code changed after some comments from Richard Sandiford.
> 
> > +  /* We are done vectorizing the main loop, so now we update the
> > epilogues
> > + stmt_vec_info's.  At the same time we set the gimple UID of each
> > + statement in the epilogue, as these are used to look them up in the
> > + epilogues loop_vec_info later.  We also keep track of what
> > + stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might
> > + need updating and we construct a mapping between variables defined
> > in
> > + the main loop and their corresponding names in epilogue.  */
> > +  for (unsigned i = 0; i < epilogue->num_nodes; ++i)
> > 
> > so for the following code I wonder if you can make use of the
> > fact that loop copying also copies UIDs, so you should be able
> > to match stmts via their UIDs and get at the other loop infos
> > stmt_info by the copy loop stmt UID.
> > 
> > I wonder why you need no modification for the SLP tree?
> > 
> I checked with Tamar and the SLP tree works with the position of operands and
> not SSA_NAMES.  So we should be fine.

There's now SLP_TREE_SCALAR_OPS but only for invariants so I guess
we should indeed be fine here.  Everything else is already
stmt_infos which you patch with the new underlying stmts.

Richard.


Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-10-25 Thread Andre Vieira (lists)

Hi,

This is the reworked patch after your comments.

I have moved the epilogue check into the analysis form disguised under 
'!epilogue_vinfos.is_empty ()'.  This because I realized that I am doing 
the "lowest threshold" check there.


The only place where we may reject an epilogue_vinfo is when we know the 
number of scalar iterations and we realize the number of iterations left 
after the main loop are not enough to enter the vectorized epilogue so 
we optimize away that code-gen.  The only way we know this to be true is 
if the number of scalar iterations are known and the peeling for 
alignment is known. So we know we will enter the main loop regardless, 
so whether the threshold we use is for a lower VF or not it shouldn't 
matter as much, I would even like to think that check isn't done, but I 
am not sure... Might be worth checking as an optimization.



Is this OK for trunk?

gcc/ChangeLog:
2019-10-25  Andre Vieira  

PR 88915
* tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
* tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
and make the valueize function pointer also take a void pointer.
* gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
around vn_valueize, to call it without a context.
(process_bb): Use vn_valueize_wrapper instead of vn_valueize.
* tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.
(~_loop_vec_info): Release epilogue_vinfos.
(vect_analyze_loop_costing): Use knowledge of main VF to estimate
number of iterations of epilogue.
(vect_analyze_loop_2): Adapt to analyse main loop for all supported
vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
versioning threshold needed for main loop.
(vect_analyze_loop): Likewise.
(find_in_mapping): New helper function.
(update_epilogue_loop_vinfo): New function.
(vect_transform_loop): When vectorizing epilogues re-use analysis done
on main loop and call update_epilogue_loop_vinfo to update it.
* tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
stmts on loop preheader edge.
(vect_do_peeling): Enable skip-vectors when doing loop versioning if
we decided to vectorize epilogues.  Update epilogues NITERS and
construct ADVANCE to update epilogues data references where needed.
* tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.
(vect_do_peeling, vect_update_inits_of_drs,
 determine_peel_for_niter, vect_analyze_loop): Add or update 
declarations.

* tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
created loop_vec_info's for epilogues when available.  Otherwise 
analyse

epilogue separately.



Cheers,
Andre
diff --git a/gcc/tree-ssa-loop-niter.h b/gcc/tree-ssa-loop-niter.h
index 4454c1ac78e02228047511a9e0214c82946855b8..aec6225125ce42ab0e4dbc930fc1a93862e6e267 100644
--- a/gcc/tree-ssa-loop-niter.h
+++ b/gcc/tree-ssa-loop-niter.h
@@ -53,7 +53,9 @@ extern bool scev_probably_wraps_p (tree, tree, tree, gimple *,
    class loop *, bool);
 extern void free_numbers_of_iterations_estimates (class loop *);
 extern void free_numbers_of_iterations_estimates (function *);
-extern tree simplify_replace_tree (tree, tree, tree, tree (*)(tree) = NULL);
+extern tree simplify_replace_tree (tree, tree,
+   tree, tree (*)(tree, void *) = NULL,
+   void * = NULL);
 extern void substitute_in_loop_info (class loop *, tree, tree);
 
 #endif /* GCC_TREE_SSA_LOOP_NITER_H */
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index cd2ced369719c37afd4aac08ff360719d7702e42..db666f019808850ed3a4aeef1a454a7ae2c65ef2 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -1935,7 +1935,7 @@ number_of_iterations_cond (class loop *loop,
 
 tree
 simplify_replace_tree (tree expr, tree old, tree new_tree,
-		   tree (*valueize) (tree))
+		   tree (*valueize) (tree, void*), void *context)
 {
   unsigned i, n;
   tree ret = NULL_TREE, e, se;
@@ -1951,7 +1951,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree,
 {
   if (TREE_CODE (expr) == SSA_NAME)
 	{
-	  new_tree = valueize (expr);
+	  new_tree = valueize (expr, context);
 	  if (new_tree != expr)
 	return new_tree;
 	}
@@ -1967,7 +1967,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree,
   for (i = 0; i < n; i++)
 {
   e = TREE_OPERAND (expr, i);
-  se = simplify_replace_tree (e, old, new_tree, valueize);
+  se = simplify_replace_tree (e, old, new_tree, valueize, context);
   if (e == se)
 	continue;
 
diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c
index 57331ab44dc78c16d97065cd28e8c4cdcbf8d96e..0abe3fb8453ecf2e25ff55c5c9846663f68f7c8c 100644
--- a/gcc/tree-ssa-sccvn.c
+++ b/gcc/tree-ssa-sccvn.c
@@ -309,6 +309,10 @@ static vn_tables_t valid_info;
 /* Valueization hook.  Valueize NAME if it is an SSA name, otherwise
just return it.  */
 tree (*vn_valueize) (tree);

Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-10-25 Thread Andre Vieira (lists)




On 22/10/2019 14:56, Richard Biener wrote:

On Tue, 22 Oct 2019, Andre Vieira (lists) wrote:


Hi Richi,

See inline responses to your comments.

On 11/10/2019 13:57, Richard Biener wrote:

On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:


Hi,





+
+  /* Keep track of vector sizes we know we can vectorize the epilogue
with.  */
+  vector_sizes epilogue_vsizes;
   };

please don't enlarge struct loop, instead track this somewhere
in the vectorizer (in loop_vinfo?  I see you already have
epilogue_vinfos there - so the loop_vinfo simply lacks
convenient access to the vector_size?)  I don't see any
use that could be trivially adjusted to look at a loop_vinfo
member instead.


Done.


For the vect_update_inits_of_drs this means that we'd possibly
do less CSE.  Not sure if really an issue.


CSE of what exactly? You are afraid we are repeating a calculation here we
have done elsewhere before?


All uses of those inits now possibly get the expression instead of
just the SSA name we inserted code for once.  But as said, we'll see.



This code changed after some comments from Richard Sandiford.


+  /* We are done vectorizing the main loop, so now we update the
epilogues
+ stmt_vec_info's.  At the same time we set the gimple UID of each
+ statement in the epilogue, as these are used to look them up in the
+ epilogues loop_vec_info later.  We also keep track of what
+ stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might
+ need updating and we construct a mapping between variables defined
in
+ the main loop and their corresponding names in epilogue.  */
+  for (unsigned i = 0; i < epilogue->num_nodes; ++i)

so for the following code I wonder if you can make use of the
fact that loop copying also copies UIDs, so you should be able
to match stmts via their UIDs and get at the other loop infos
stmt_info by the copy loop stmt UID.

I wonder why you need no modification for the SLP tree?

I checked with Tamar and the SLP tree works with the position of 
operands and not SSA_NAMES.  So we should be fine.


Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-10-25 Thread Andre Vieira (lists)




On 22/10/2019 18:52, Richard Sandiford wrote:

Thanks for doing this.  Hope this message doesn't cover too much old
ground or duplicate too much...

"Andre Vieira (lists)"  writes:

@@ -2466,15 +2476,65 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, 
tree nitersm1,
else
  niters_prolog = build_int_cst (type, 0);
  
+  loop_vec_info epilogue_vinfo = NULL;

+  if (vect_epilogues)
+{
+  /* Take the next epilogue_vinfo to vectorize for.  */
+  epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
+  loop_vinfo->epilogue_vinfos.ordered_remove (0);
+
+  /* Don't vectorize epilogues if this is not the most inner loop or if
+the epilogue may need peeling for alignment as the vectorizer doesn't
+know how to handle these situations properly yet.  */
+  if (loop->inner != NULL
+ || LOOP_VINFO_PEELING_FOR_ALIGNMENT (epilogue_vinfo))
+   vect_epilogues = false;
+
+}


Nit: excess blank line before "}".  Sorry if this was discussed before,
but what's the reason for delaying the check for "loop->inner" to
this point, rather than doing it in vect_analyze_loop?


Done.



+
+  tree niters_vector_mult_vf;
+  unsigned int lowest_vf = constant_lower_bound (vf);
+  /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
+ on niters already ajusted for the iterations of the prologue.  */


Pre-existing typo: adjusted.  But...


+  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+  && known_eq (vf, lowest_vf))
+{
+  loop_vec_info orig_loop_vinfo;
+  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+   orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
+  else
+   orig_loop_vinfo = loop_vinfo;
+  vector_sizes vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo);
+  unsigned next_size = 0;
+  unsigned HOST_WIDE_INT eiters
+   = (LOOP_VINFO_INT_NITERS (loop_vinfo)
+  - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
+
+  if (prolog_peeling > 0)
+   eiters -= prolog_peeling;


...is that comment still true?  We're now subtracting the peeling
amount here.


It is not, "adjusted" the comment ;)


Might be worth asserting prolog_peeling >= 0, just to emphasise
that we can't get here for variable peeling amounts, and then subtract
prolog_peeling unconditionally (assuming that's the right thing to do).

Can't assert as LOOP_VINFO_NITERS_KNOWN_P can be true even with 
prolog_peeling < 0, since we still know the constant number of scalar 
iterations, we just don't know how many vector iterations will be 
performed due to the runtime peeling. I will however, not reject 
vectorizing the epilogue, when we don't know how much we are peeling.

+  eiters
+   = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
+
+  unsigned int ratio;
+  while (next_size < vector_sizes.length ()
+&& !(constant_multiple_p (current_vector_size,
+  vector_sizes[next_size], )
+ && eiters >= lowest_vf / ratio))
+   next_size += 1;
+
+  if (next_size == vector_sizes.length ())
+   vect_epilogues = false;
+}
+
/* Prolog loop may be skipped.  */
bool skip_prolog = (prolog_peeling != 0);
/* Skip to epilog if scalar loop may be preferred.  It's only needed
- when we peel for epilog loop and when it hasn't been checked with
- loop versioning.  */
+ when we peel for epilog loop or when we loop version.  */
bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
  ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
  bound_prolog + bound_epilog)
- : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
+ : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+|| vect_epilogues));


The comment update looks wrong here: without epilogues, we don't need
the skip when loop versioning, because loop versioning ensures that we
have at least one vector iteration.

(I think "it" was supposed to mean "skipping to the epilogue" rather
than the epilogue loop itself, in case that's the confusion.)

It'd be good to mention the epilogue condition in the comment too.



Rewrote comment, hopefully this now better reflects reality.


+
+  if (vect_epilogues)
+{
+  epilog->aux = epilogue_vinfo;
+  LOOP_VINFO_LOOP (epilogue_vinfo) = epilog;
+
+  loop_constraint_clear (epilog, LOOP_C_INFINITE);
+
+  /* We now must calculate the number of iterations for our epilogue.  */
+  tree cond_niters, niters;
+
+  /* Depending on whether we peel for gaps we take niters or niters - 1,
+we will refer to this as N - G, where N and G are the NITERS and
+GAP for the original loop.  */
+  niters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+   ? LOOP_VINFO_NITERSM1 (loop_vinfo)
+   : LOOP_VINFO_NITERS (loop_vinfo);
+
+  /* Here we build a vector factorization mask:
+vf_mask = ~(VF - 1), 

Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-10-22 Thread Richard Sandiford
Thanks for doing this.  Hope this message doesn't cover too much old
ground or duplicate too much...

"Andre Vieira (lists)"  writes:
> @@ -2466,15 +2476,65 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
> niters, tree nitersm1,
>else
>  niters_prolog = build_int_cst (type, 0);
>  
> +  loop_vec_info epilogue_vinfo = NULL;
> +  if (vect_epilogues)
> +{
> +  /* Take the next epilogue_vinfo to vectorize for.  */
> +  epilogue_vinfo = loop_vinfo->epilogue_vinfos[0];
> +  loop_vinfo->epilogue_vinfos.ordered_remove (0);
> +
> +  /* Don't vectorize epilogues if this is not the most inner loop or if
> +  the epilogue may need peeling for alignment as the vectorizer doesn't
> +  know how to handle these situations properly yet.  */
> +  if (loop->inner != NULL
> +   || LOOP_VINFO_PEELING_FOR_ALIGNMENT (epilogue_vinfo))
> + vect_epilogues = false;
> +
> +}

Nit: excess blank line before "}".  Sorry if this was discussed before,
but what's the reason for delaying the check for "loop->inner" to
this point, rather than doing it in vect_analyze_loop?

> +
> +  tree niters_vector_mult_vf;
> +  unsigned int lowest_vf = constant_lower_bound (vf);
> +  /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work
> + on niters already ajusted for the iterations of the prologue.  */

Pre-existing typo: adjusted.  But...

> +  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> +  && known_eq (vf, lowest_vf))
> +{
> +  loop_vec_info orig_loop_vinfo;
> +  if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> + orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
> +  else
> + orig_loop_vinfo = loop_vinfo;
> +  vector_sizes vector_sizes = LOOP_VINFO_EPILOGUE_SIZES 
> (orig_loop_vinfo);
> +  unsigned next_size = 0;
> +  unsigned HOST_WIDE_INT eiters
> + = (LOOP_VINFO_INT_NITERS (loop_vinfo)
> +- LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
> +
> +  if (prolog_peeling > 0)
> + eiters -= prolog_peeling;

...is that comment still true?  We're now subtracting the peeling
amount here.

Might be worth asserting prolog_peeling >= 0, just to emphasise
that we can't get here for variable peeling amounts, and then subtract
prolog_peeling unconditionally (assuming that's the right thing to do).

> +  eiters
> + = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
> +
> +  unsigned int ratio;
> +  while (next_size < vector_sizes.length ()
> +  && !(constant_multiple_p (current_vector_size,
> +vector_sizes[next_size], )
> +   && eiters >= lowest_vf / ratio))
> + next_size += 1;
> +
> +  if (next_size == vector_sizes.length ())
> + vect_epilogues = false;
> +}
> +
>/* Prolog loop may be skipped.  */
>bool skip_prolog = (prolog_peeling != 0);
>/* Skip to epilog if scalar loop may be preferred.  It's only needed
> - when we peel for epilog loop and when it hasn't been checked with
> - loop versioning.  */
> + when we peel for epilog loop or when we loop version.  */
>bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
> bound_prolog + bound_epilog)
> -   : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
> +   : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
> +  || vect_epilogues));

The comment update looks wrong here: without epilogues, we don't need
the skip when loop versioning, because loop versioning ensures that we
have at least one vector iteration.

(I think "it" was supposed to mean "skipping to the epilogue" rather
than the epilogue loop itself, in case that's the confusion.)

It'd be good to mention the epilogue condition in the comment too.

> @@ -2504,6 +2564,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
> niters, tree nitersm1,
>  
>dump_user_location_t loop_loc = find_loop_location (loop);
>class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
> +  if (vect_epilogues)
> +/* Make sure to set the epilogue's epilogue scalar loop, such that we can
> +   we can use the original scalar loop as remaining epilogue if
> +   necessary.  */

Double "we can".

> +LOOP_VINFO_SCALAR_LOOP (epilogue_vinfo)
> +  = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
> +
>if (prolog_peeling)
>  {
>e = loop_preheader_edge (loop);
> @@ -2584,14 +2651,22 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
> niters, tree nitersm1,
>  "loop can't be duplicated to exit edge.\n");
> gcc_unreachable ();
>   }
> -  /* Peel epilog and put it on exit edge of loop.  */
> -  epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e);
> +  /* Peel epilog and put it on exit edge of loop.  If we are vectorizing
> +  said epilog then we should use a copy of the main loop as a 

Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-10-22 Thread Richard Biener
On Tue, 22 Oct 2019, Andre Vieira (lists) wrote:

> Hi Richi,
> 
> See inline responses to your comments.
> 
> On 11/10/2019 13:57, Richard Biener wrote:
> > On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:
> > 
> >> Hi,
> >>
> 
> > 
> > +
> > +  /* Keep track of vector sizes we know we can vectorize the epilogue
> > with.  */
> > +  vector_sizes epilogue_vsizes;
> >   };
> > 
> > please don't enlarge struct loop, instead track this somewhere
> > in the vectorizer (in loop_vinfo?  I see you already have
> > epilogue_vinfos there - so the loop_vinfo simply lacks
> > convenient access to the vector_size?)  I don't see any
> > use that could be trivially adjusted to look at a loop_vinfo
> > member instead.
> 
> Done.
> > 
> > For the vect_update_inits_of_drs this means that we'd possibly
> > do less CSE.  Not sure if really an issue.
> 
> CSE of what exactly? You are afraid we are repeating a calculation here we
> have done elsewhere before?

All uses of those inits now possibly get the expression instead of
just the SSA name we inserted code for once.  But as said, we'll see.

> > 
> > You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes
> > LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to
> > LOOP_VINFO_EPILOGUE_P.
> 
> I checked and the points where I use LOOP_VINFO_ORIG_LOOP_INFO is because I
> then use the resulting loop info.  If there are cases you feel strongly about
> let me know.

Not too strongly, no.

> > 
> > @@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
> > niters, tree nitersm1,
> > else
> >   niters_prolog = build_int_cst (type, 0);
> >   
> > +  loop_vec_info epilogue_vinfo = NULL;
> > +  if (vect_epilogues)
> > +{
> > ...
> > +   vect_epilogues = false;
> > +}
> > +
> > 
> > I don't understand what all this does - it clearly needs a comment.
> > Maybe the overall comment of the function should be amended with
> > an overview of how we handle [multiple] epilogue loop vectorization?
> 
> I added more comments both here and on top of the function.  Hopefully it is a
> bit clearer now, but it might need some tweaking.
> 
> > 
> > +
> > +  if (epilogue_any_upper_bound && prolog_peeling >= 0)
> > +   {
> > + epilog->any_upper_bound = true;
> > + epilog->nb_iterations_upper_bound = eiters + 1;
> > +   }
> > +
> > 
> > comment missing.  How can prolog_peeling be < 0?  We likely
> > didn't set the upper bound because we don't know it in the
> > case we skipped the vector loop (skip_vector)?  So make sure
> > to not introduce wrong-code issues here - maybe do this
> > optimization as followup?n
> > 
> 
> So the reason for this code wasn't so much an optimization as it was for
> correctness.  But I was mistaken, the failure I was seeing without this code
> was not because of this code, but rather being hidden by it. The problem I was
> seeing was that a prolog was being created using the original loop copy,
> rather than the scalar loop, leading to MASK_LOAD and MASK_STORE being left in
> the scalar prolog, leading to expand ICEs. I have fixed that issue by making
> sure the SCALAR_LOOP is used for prolog peeling and either the loop copy or
> SCALAR loop for epilogue peeling depending on whether we will be vectorizing
> said epilogue.

OK.

> 
> > @@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info
> > loop_vinfo)
> > return 0;
> >   }
> > 
> > -  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
> > +  HOST_WIDE_INT estimated_niter = -1;
> > +
> > +  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
> > +estimated_niter
> > +  = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;
> > +  if (estimated_niter == -1)
> > +estimated_niter = estimated_stmt_executions_int (loop);
> > if (estimated_niter == -1)
> >   estimated_niter = likely_max_stmt_executions_int (loop);
> > if (estimated_niter != -1
> > 
> > it's clearer if the old code is completely in a else {} path
> > even though vect_vf_for_cost - 1 should never be -1.
> > 
> Done for the == -1 cases, need to keep the != -1 outside of course.
> > -  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
> > +  if (LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > +  || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
> > + && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo)))
> > 
> > not sure why we need to do this for epilouges?
> > 
> 
> This is because we want to compute the versioning threshold for epilogues such
> that we can use the minimum versioning threshold when versioning the main
> loop.  The reason we need to ask we need to ask the original main loop is
> partially because of code in 'vect_analyze_data_ref_dependences' that chooses
> to not do DR dependence analysis and thus never fills
> LOOP_VINFO_MAY_ALIAS_DDRS for the epilogues loop_vinfo and as a consequence
> LOOP_VINFO_COMP_ALIAS_DDRS is always 0.
> 
> The piece of code is preceded by this comment:
>   /* For epilogues we 

Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-10-22 Thread Andre Vieira (lists)

Hi Richi,

See inline responses to your comments.

On 11/10/2019 13:57, Richard Biener wrote:

On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:


Hi,





+
+  /* Keep track of vector sizes we know we can vectorize the epilogue
with.  */
+  vector_sizes epilogue_vsizes;
  };

please don't enlarge struct loop, instead track this somewhere
in the vectorizer (in loop_vinfo?  I see you already have
epilogue_vinfos there - so the loop_vinfo simply lacks
convenient access to the vector_size?)  I don't see any
use that could be trivially adjusted to look at a loop_vinfo
member instead.


Done.


For the vect_update_inits_of_drs this means that we'd possibly
do less CSE.  Not sure if really an issue.


CSE of what exactly? You are afraid we are repeating a calculation here 
we have done elsewhere before?




You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes
LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to
LOOP_VINFO_EPILOGUE_P.


I checked and the points where I use LOOP_VINFO_ORIG_LOOP_INFO is 
because I then use the resulting loop info.  If there are cases you feel 
strongly about let me know.


@@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
niters, tree nitersm1,
else
  niters_prolog = build_int_cst (type, 0);
  
+  loop_vec_info epilogue_vinfo = NULL;

+  if (vect_epilogues)
+{
...
+   vect_epilogues = false;
+}
+

I don't understand what all this does - it clearly needs a comment.
Maybe the overall comment of the function should be amended with
an overview of how we handle [multiple] epilogue loop vectorization?


I added more comments both here and on top of the function.  Hopefully 
it is a bit clearer now, but it might need some tweaking.




+
+  if (epilogue_any_upper_bound && prolog_peeling >= 0)
+   {
+ epilog->any_upper_bound = true;
+ epilog->nb_iterations_upper_bound = eiters + 1;
+   }
+

comment missing.  How can prolog_peeling be < 0?  We likely
didn't set the upper bound because we don't know it in the
case we skipped the vector loop (skip_vector)?  So make sure
to not introduce wrong-code issues here - maybe do this
optimization as followup?n



So the reason for this code wasn't so much an optimization as it was for 
correctness.  But I was mistaken, the failure I was seeing without this 
code was not because of this code, but rather being hidden by it. The 
problem I was seeing was that a prolog was being created using the 
original loop copy, rather than the scalar loop, leading to MASK_LOAD 
and MASK_STORE being left in the scalar prolog, leading to expand ICEs. 
I have fixed that issue by making sure the SCALAR_LOOP is used for 
prolog peeling and either the loop copy or SCALAR loop for epilogue 
peeling depending on whether we will be vectorizing said epilogue.




@@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info
loop_vinfo)
return 0;
  }

-  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
+  HOST_WIDE_INT estimated_niter = -1;
+
+  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+estimated_niter
+  = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;
+  if (estimated_niter == -1)
+estimated_niter = estimated_stmt_executions_int (loop);
if (estimated_niter == -1)
  estimated_niter = likely_max_stmt_executions_int (loop);
if (estimated_niter != -1

it's clearer if the old code is completely in a else {} path
even though vect_vf_for_cost - 1 should never be -1.


Done for the == -1 cases, need to keep the != -1 outside of course.

-  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
+  if (LOOP_REQUIRES_VERSIONING (loop_vinfo)
+  || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+ && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo)))

not sure why we need to do this for epilouges?



This is because we want to compute the versioning threshold for 
epilogues such that we can use the minimum versioning threshold when 
versioning the main loop.  The reason we need to ask we need to ask the 
original main loop is partially because of code in 
'vect_analyze_data_ref_dependences' that chooses to not do DR dependence 
analysis and thus never fills LOOP_VINFO_MAY_ALIAS_DDRS for the 
epilogues loop_vinfo and as a consequence LOOP_VINFO_COMP_ALIAS_DDRS is 
always 0.


The piece of code is preceded by this comment:
  /* For epilogues we either have no aliases or alias versioning
 was applied to original loop.  Therefore we may just get max_vf
 using VF of original loop.  */

I have added some comments to make it clearer.


+static tree
+replace_ops (tree op, hash_map )
+{

I'm quite sure I've seen such beast elsewhere ;)  simplify_replace_tree
comes up first (not a 1:1 match but hints at a possible tree
sharing issue in your variant).



The reason I couldn't use simplify_replace_tree is because I didn't know 
what the "OLD" value is at the time I want to call it.  Basically I want 
to check whether an SSA name is 

Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-10-11 Thread Richard Biener
On Thu, 10 Oct 2019, Andre Vieira (lists) wrote:

> Hi,
> 
> After all the discussions and respins I now believe this patch is close to
> what we envisioned.
> 
> This patch achieves two things when vect-epilogues-nomask=1:
> 1) It analyzes the original loop for each supported vector size and saves this
> analysis per loop, as well as the vector sizes we know we can vectorize the
> loop for.
> 2) When loop versioning it uses the 'skip_vector' code path to vectorize the
> epilogue, and uses the lowest versioning threshold between the main and
> epilogue's.
> 
> As side effects of this patch I also changed ensure_base_align to only update
> the alignment if the new alignment is lower than the current one.  This
> function already did that if the object was a symbol, now it behaves this way
> for any object.
> 
> I bootstrapped this patch with both vect-epilogues-nomask turned on and off on
> x86_64 (AVX512) and aarch64.  Regression tests looked good.
> 
> Is this OK for trunk?

+
+  /* Keep track of vector sizes we know we can vectorize the epilogue 
with.  */
+  vector_sizes epilogue_vsizes;
 };

please don't enlarge struct loop, instead track this somewhere
in the vectorizer (in loop_vinfo?  I see you already have
epilogue_vinfos there - so the loop_vinfo simply lacks 
convenient access to the vector_size?)  I don't see any
use that could be trivially adjusted to look at a loop_vinfo
member instead.

For the vect_update_inits_of_drs this means that we'd possibly
do less CSE.  Not sure if really an issue.

You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes
LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to
LOOP_VINFO_EPILOGUE_P.

@@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
niters, tree nitersm1,
   else
 niters_prolog = build_int_cst (type, 0);
 
+  loop_vec_info epilogue_vinfo = NULL;
+  if (vect_epilogues)
+{ 
...
+   vect_epilogues = false;
+}
+

I don't understand what all this does - it clearly needs a comment.
Maybe the overall comment of the function should be amended with
an overview of how we handle [multiple] epilogue loop vectorization?

+
+  if (epilogue_any_upper_bound && prolog_peeling >= 0)
+   {
+ epilog->any_upper_bound = true;
+ epilog->nb_iterations_upper_bound = eiters + 1;
+   }
+

comment missing.  How can prolog_peeling be < 0?  We likely
didn't set the upper bound because we don't know it in the
case we skipped the vector loop (skip_vector)?  So make sure
to not introduce wrong-code issues here - maybe do this
optimization as followup?

 class loop *
-vect_loop_versioning (loop_vec_info loop_vinfo,
- unsigned int th, bool check_profitability,
- poly_uint64 versioning_threshold)
+vect_loop_versioning (loop_vec_info loop_vinfo)
 { 
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop;
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
@@ -2988,10 +3151,15 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   bool version_align = LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT 
(loop_vinfo);
   bool version_alias = LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo);
   bool version_niter = LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo);
+  poly_uint64 versioning_threshold
+= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
   tree version_simd_if_cond
 = LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (loop_vinfo);
+  unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);

-  if (check_profitability)
+  if (th >= vect_vf_for_cost (loop_vinfo)
+  && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+  && !ordered_p (th, versioning_threshold))
 cond_expr = fold_build2 (GE_EXPR, boolean_type_node, 
scalar_loop_iters,
 build_int_cst (TREE_TYPE (scalar_loop_iters),
th - 1));

split out this refactoring - preapproved.

@@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info 
loop_vinfo)
   return 0;
 }

-  HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop);
+  HOST_WIDE_INT estimated_niter = -1;
+
+  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+estimated_niter
+  = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1;
+  if (estimated_niter == -1)
+estimated_niter = estimated_stmt_executions_int (loop);
   if (estimated_niter == -1)
 estimated_niter = likely_max_stmt_executions_int (loop);
   if (estimated_niter != -1

it's clearer if the old code is completely in a else {} path
even though vect_vf_for_cost - 1 should never be -1.

+/* Decides whether we need to create an epilogue loop to handle
+   remaining scalar iterations and sets PEELING_FOR_NITERS accordingly.  
*/
+  
+void  
+determine_peel_for_niter (loop_vec_info loop_vinfo)
+{   
+  

extra vertical space

+  unsigned HOST_WIDE_INT const_vf;
+  HOST_WIDE_INT max_niter 

if it's a 1:1 copy outlined then split it out - preapproved
(so further 

Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-10-10 Thread Andre Vieira (lists)

Hi,

After all the discussions and respins I now believe this patch is close 
to what we envisioned.


This patch achieves two things when vect-epilogues-nomask=1:
1) It analyzes the original loop for each supported vector size and 
saves this analysis per loop, as well as the vector sizes we know we can 
vectorize the loop for.
2) When loop versioning it uses the 'skip_vector' code path to vectorize 
the epilogue, and uses the lowest versioning threshold between the main 
and epilogue's.


As side effects of this patch I also changed ensure_base_align to only 
update the alignment if the new alignment is lower than the current one. 
 This function already did that if the object was a symbol, now it 
behaves this way for any object.


I bootstrapped this patch with both vect-epilogues-nomask turned on and 
off on x86_64 (AVX512) and aarch64.  Regression tests looked good.


Is this OK for trunk?

gcc/ChangeLog:
2019-10-10  Andre Vieira  

PR 88915
* cfgloop.h (loop): Add epilogue_vsizes member.
* cfgloop.c (flow_loop_free): Release epilogue_vsizes.
(alloc_loop): Initialize epilogue_vsizes.
* gentype.c (main): Add poly_uint64 type and vector_sizes to
generator.
* tree-vect-loop.c (vect_get_loop_niters): Make externally visible.
(_loop_vec_info): Initialize epilogue_vinfos.
(~_loop_vec_info): Release epilogue_vinfos.
(vect_analyze_loop_costing): Use knowledge of main VF to estimate
number of iterations of epilogue.
(determine_peel_for_niter): New. Outlined code to re-use in two
places.
(vect_analyze_loop_2): Adapt to analyse main loop for all supported
vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
versioning threshold needed for main loop.
(vect_analyze_loop): Likewise.
(replace_ops): New helper function.
(vect_transform_loop): When vectorizing epilogues re-use analysis done
on main loop and update necessary information.
* tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
stmts on loop preheader edge.
(vect_do_peeling): Enable skip-vectors when doing loop versioning if
we decided to vectorize epilogues.  Update epilogues NITERS and
construct ADVANCE to update epilogues data references where needed.
(vect_loop_versioning): Moved decision to check_profitability
based on cost model.
* tree-vect-stmts.c (ensure_base_align): Only update alignment
if new alignment is lower.
* tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos member.
(vect_loop_versioning, vect_do_peeling, vect_get_loop_niters,
vect_update_inits_of_drs, determine_peel_for_niter,
vect_analyze_loop): Add or update declarations.
* tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
create loop_vec_info's for epilogues when available.  Otherwise analyse
epilogue separately.



Cheers,
Andre
diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 0b0154ffd7bf031a005de993b101d9db6dd98c43..d01512ea46467f1cf77793bdc75b48e71b0b9641 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
 #define GCC_CFGLOOP_H
 
 #include "cfgloopmanip.h"
+#include "target.h"
 
 /* Structure to hold decision about unrolling/peeling.  */
 enum lpt_dec
@@ -268,6 +269,9 @@ public:
  the basic-block from being collected but its index can still be
  reused.  */
   basic_block former_header;
+
+  /* Keep track of vector sizes we know we can vectorize the epilogue with.  */
+  vector_sizes epilogue_vsizes;
 };
 
 /* Set if the loop is known to be infinite.  */
diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c
index 4ad1f658708f83dbd8789666c26d4bd056837bc6..f3e81bcd00b3f125389aa15b12dc5201b3578d20 100644
--- a/gcc/cfgloop.c
+++ b/gcc/cfgloop.c
@@ -198,6 +198,7 @@ flow_loop_free (class loop *loop)
   exit->prev = exit;
 }
 
+  loop->epilogue_vsizes.release();
   ggc_free (loop->exits);
   ggc_free (loop);
 }
@@ -355,6 +356,7 @@ alloc_loop (void)
   loop->nb_iterations_upper_bound = 0;
   loop->nb_iterations_likely_upper_bound = 0;
   loop->nb_iterations_estimate = 0;
+  loop->epilogue_vsizes.create(8);
   return loop;
 }
 
diff --git a/gcc/gengtype.c b/gcc/gengtype.c
index 53317337cf8c8e8caefd6b819d28b3bba301e755..80fb6ef71465b24e034fa45d69fec56be6b2e7f8 100644
--- a/gcc/gengtype.c
+++ b/gcc/gengtype.c
@@ -5197,6 +5197,7 @@ main (int argc, char **argv)
   POS_HERE (do_scalar_typedef ("widest_int", ));
   POS_HERE (do_scalar_typedef ("int64_t", ));
   POS_HERE (do_scalar_typedef ("poly_int64", ));
+  POS_HERE (do_scalar_typedef ("poly_uint64", ));
   POS_HERE (do_scalar_typedef ("uint64_t", ));
   POS_HERE (do_scalar_typedef ("uint8", ));
   POS_HERE (do_scalar_typedef ("uintptr_t", ));
@@ -5206,6 +5207,7 @@ main (int argc, char **argv)
   POS_HERE (do_scalar_typedef ("machine_mode", ));
   POS_HERE (do_scalar_typedef ("fixed_size_mode", ));
   POS_HERE (do_scalar_typedef 

Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-10-09 Thread Richard Biener
On Tue, 8 Oct 2019, Andre Vieira (lists) wrote:

> Hi Richard,
> 
> As I mentioned in the IRC channel, I managed to get "most" of the regression
> testsuite working for x86_64 (avx512) and aarch64.
> 
> On x86_64 I get a failure that I can't explain, was hoping you might be able
> to have a look with me:
> "PASS->FAIL: gcc.target/i386/vect-perm-odd-1.c execution test"
> 
> vect-perm-odd-1.exe segfaults and when I gdb it seems to be the first
> iteration of the main loop.  The tree dumps look alright, but I do notice the
> stack usage seems to change between --param vect-epilogue-nomask={0,1}.

So the issue is that we have

=> 0x00400778 <+72>:vmovdqa64 %zmm1,-0x40(%rax)

but the memory accessed is not appropriately aligned.  The vectorizer
sets DECL_USER_ALIGN on the stack local but somehow later it downs
it to 256:

Old value = 640
New value = 576
ensure_base_align (dr_info=0x526f788) at 
/tmp/trunk/gcc/tree-vect-stmts.c:6294
6294  DECL_USER_ALIGN (base_decl) = 1;
(gdb) l
6289  if (decl_in_symtab_p (base_decl))
6290symtab_node::get (base_decl)->increase_alignment 
(align_base_to);
6291  else
6292{
6293  SET_DECL_ALIGN (base_decl, align_base_to);
6294  DECL_USER_ALIGN (base_decl) = 1;
6295}

this means vectorizing the epilogue modifies the DRs, in particular
the base alignment?

> Am I missing to update some field that may later lead to the amount of stack
> being used? I am confused, it could very well be that I am missing something
> obvious, I am not too familiar with x86's ISA. I will try to investigate
> further.
> 
> This patch needs further clean-up and more comments (or comment updates), but
> I thought I'd share current state to see if you can help me unblock.
> 
> Cheers,
> Andre
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 247165 (AG München)

Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-10-08 Thread Andre Vieira (lists)

Hi Richard,

As I mentioned in the IRC channel, I managed to get "most" of the 
regression testsuite working for x86_64 (avx512) and aarch64.


On x86_64 I get a failure that I can't explain, was hoping you might be 
able to have a look with me:

"PASS->FAIL: gcc.target/i386/vect-perm-odd-1.c execution test"

vect-perm-odd-1.exe segfaults and when I gdb it seems to be the first 
iteration of the main loop.  The tree dumps look alright, but I do 
notice the stack usage seems to change between --param 
vect-epilogue-nomask={0,1}.


Am I missing to update some field that may later lead to the amount of 
stack being used? I am confused, it could very well be that I am missing 
something obvious, I am not too familiar with x86's ISA. I will try to 
investigate further.


This patch needs further clean-up and more comments (or comment 
updates), but I thought I'd share current state to see if you can help 
me unblock.


Cheers,
Andre
diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 0b0154ffd7bf031a005de993b101d9db6dd98c43..d01512ea46467f1cf77793bdc75b48e71b0b9641 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
 #define GCC_CFGLOOP_H
 
 #include "cfgloopmanip.h"
+#include "target.h"
 
 /* Structure to hold decision about unrolling/peeling.  */
 enum lpt_dec
@@ -268,6 +269,9 @@ public:
  the basic-block from being collected but its index can still be
  reused.  */
   basic_block former_header;
+
+  /* Keep track of vector sizes we know we can vectorize the epilogue with.  */
+  vector_sizes epilogue_vsizes;
 };
 
 /* Set if the loop is known to be infinite.  */
diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c
index 4ad1f658708f83dbd8789666c26d4bd056837bc6..f3e81bcd00b3f125389aa15b12dc5201b3578d20 100644
--- a/gcc/cfgloop.c
+++ b/gcc/cfgloop.c
@@ -198,6 +198,7 @@ flow_loop_free (class loop *loop)
   exit->prev = exit;
 }
 
+  loop->epilogue_vsizes.release();
   ggc_free (loop->exits);
   ggc_free (loop);
 }
@@ -355,6 +356,7 @@ alloc_loop (void)
   loop->nb_iterations_upper_bound = 0;
   loop->nb_iterations_likely_upper_bound = 0;
   loop->nb_iterations_estimate = 0;
+  loop->epilogue_vsizes.create(8);
   return loop;
 }
 
diff --git a/gcc/gengtype.c b/gcc/gengtype.c
index 53317337cf8c8e8caefd6b819d28b3bba301e755..80fb6ef71465b24e034fa45d69fec56be6b2e7f8 100644
--- a/gcc/gengtype.c
+++ b/gcc/gengtype.c
@@ -5197,6 +5197,7 @@ main (int argc, char **argv)
   POS_HERE (do_scalar_typedef ("widest_int", ));
   POS_HERE (do_scalar_typedef ("int64_t", ));
   POS_HERE (do_scalar_typedef ("poly_int64", ));
+  POS_HERE (do_scalar_typedef ("poly_uint64", ));
   POS_HERE (do_scalar_typedef ("uint64_t", ));
   POS_HERE (do_scalar_typedef ("uint8", ));
   POS_HERE (do_scalar_typedef ("uintptr_t", ));
@@ -5206,6 +5207,7 @@ main (int argc, char **argv)
   POS_HERE (do_scalar_typedef ("machine_mode", ));
   POS_HERE (do_scalar_typedef ("fixed_size_mode", ));
   POS_HERE (do_scalar_typedef ("CONSTEXPR", ));
+  POS_HERE (do_scalar_typedef ("vector_sizes", ));
   POS_HERE (do_typedef ("PTR", 
 			create_pointer (resolve_typedef ("void", )),
 			));
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 5c25441c70a271f04730486e513437fffa75b7e3..189f7458b1b20be06a9a20d3ee05e74bc176434c 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -26,6 +26,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree.h"
 #include "gimple.h"
 #include "cfghooks.h"
+#include "tree-if-conv.h"
 #include "tree-pass.h"
 #include "ssa.h"
 #include "fold-const.h"
@@ -1724,7 +1725,7 @@ vect_update_init_of_dr (struct data_reference *dr, tree niters, tree_code code)
Apply vect_update_inits_of_dr to all accesses in LOOP_VINFO.
CODE and NITERS are as for vect_update_inits_of_dr.  */
 
-static void
+void
 vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 			  tree_code code)
 {
@@ -1736,19 +1737,7 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 
   /* Adjust niters to sizetype and insert stmts on loop preheader edge.  */
   if (!types_compatible_p (sizetype, TREE_TYPE (niters)))
-{
-  gimple_seq seq;
-  edge pe = loop_preheader_edge (LOOP_VINFO_LOOP (loop_vinfo));
-  tree var = create_tmp_var (sizetype, "prolog_loop_adjusted_niters");
-
-  niters = fold_convert (sizetype, niters);
-  niters = force_gimple_operand (niters, , false, var);
-  if (seq)
-	{
-	  basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, seq);
-	  gcc_assert (!new_bb);
-	}
-}
+niters = fold_convert (sizetype, niters);
 
   FOR_EACH_VEC_ELT (datarefs, i, dr)
 {
@@ -2401,14 +2390,18 @@ class loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, tree *step_vector,
 		 tree *niters_vector_mult_vf_var, int th,
-		 bool check_profitability, bool niters_no_overflow)
+		 

Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-09-18 Thread Andre Vieira (lists)

Hi Richard,

As I mentioned in the IRC channel, this is my current work in progress 
patch. It currently ICE's when vectorizing 
gcc/testsuite/gcc.c-torture/execute/nestfunc-2.c with '-O3' and '--param 
vect-epilogues-nomask=1'.


It ICE's because the epilogue loop (after if conversion) and main loop 
(before vectorization) are not the same, there are a bunch of extra BBs 
and the epilogue loop seems to need some cleaning up too.


Let me know if you see a way around this issue.

Cheers,
Andre
diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 0b0154ffd7bf031a005de993b101d9db6dd98c43..d01512ea46467f1cf77793bdc75b48e71b0b9641 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
 #define GCC_CFGLOOP_H
 
 #include "cfgloopmanip.h"
+#include "target.h"
 
 /* Structure to hold decision about unrolling/peeling.  */
 enum lpt_dec
@@ -268,6 +269,9 @@ public:
  the basic-block from being collected but its index can still be
  reused.  */
   basic_block former_header;
+
+  /* Keep track of vector sizes we know we can vectorize the epilogue with.  */
+  vector_sizes epilogue_vsizes;
 };
 
 /* Set if the loop is known to be infinite.  */
diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c
index 4ad1f658708f83dbd8789666c26d4bd056837bc6..f3e81bcd00b3f125389aa15b12dc5201b3578d20 100644
--- a/gcc/cfgloop.c
+++ b/gcc/cfgloop.c
@@ -198,6 +198,7 @@ flow_loop_free (class loop *loop)
   exit->prev = exit;
 }
 
+  loop->epilogue_vsizes.release();
   ggc_free (loop->exits);
   ggc_free (loop);
 }
@@ -355,6 +356,7 @@ alloc_loop (void)
   loop->nb_iterations_upper_bound = 0;
   loop->nb_iterations_likely_upper_bound = 0;
   loop->nb_iterations_estimate = 0;
+  loop->epilogue_vsizes.create(8);
   return loop;
 }
 
diff --git a/gcc/gengtype.c b/gcc/gengtype.c
index 53317337cf8c8e8caefd6b819d28b3bba301e755..80fb6ef71465b24e034fa45d69fec56be6b2e7f8 100644
--- a/gcc/gengtype.c
+++ b/gcc/gengtype.c
@@ -5197,6 +5197,7 @@ main (int argc, char **argv)
   POS_HERE (do_scalar_typedef ("widest_int", ));
   POS_HERE (do_scalar_typedef ("int64_t", ));
   POS_HERE (do_scalar_typedef ("poly_int64", ));
+  POS_HERE (do_scalar_typedef ("poly_uint64", ));
   POS_HERE (do_scalar_typedef ("uint64_t", ));
   POS_HERE (do_scalar_typedef ("uint8", ));
   POS_HERE (do_scalar_typedef ("uintptr_t", ));
@@ -5206,6 +5207,7 @@ main (int argc, char **argv)
   POS_HERE (do_scalar_typedef ("machine_mode", ));
   POS_HERE (do_scalar_typedef ("fixed_size_mode", ));
   POS_HERE (do_scalar_typedef ("CONSTEXPR", ));
+  POS_HERE (do_scalar_typedef ("vector_sizes", ));
   POS_HERE (do_typedef ("PTR", 
 			create_pointer (resolve_typedef ("void", )),
 			));
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 5c25441c70a271f04730486e513437fffa75b7e3..b1c13dafdeeec8d95f00c232822d3ab9b11f4046 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -26,6 +26,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree.h"
 #include "gimple.h"
 #include "cfghooks.h"
+#include "tree-if-conv.h"
 #include "tree-pass.h"
 #include "ssa.h"
 #include "fold-const.h"
@@ -1730,6 +1731,7 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
 {
   unsigned int i;
   vec datarefs = LOOP_VINFO_DATAREFS (loop_vinfo);
+  vec datarefs_copy = loop_vinfo->shared->datarefs_copy;
   struct data_reference *dr;
 
   DUMP_VECT_SCOPE ("vect_update_inits_of_dr");
@@ -1756,6 +1758,12 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters,
   if (!STMT_VINFO_GATHER_SCATTER_P (dr_info->stmt))
 	vect_update_init_of_dr (dr, niters, code);
 }
+  FOR_EACH_VEC_ELT (datarefs_copy, i, dr)
+{
+  dr_vec_info *dr_info = loop_vinfo->lookup_dr (dr);
+  if (!STMT_VINFO_GATHER_SCATTER_P (dr_info->stmt))
+	vect_update_init_of_dr (dr, niters, code);
+}
 }
 
 /* For the information recorded in LOOP_VINFO prepare the loop for peeling
@@ -2409,6 +2417,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   profile_probability prob_prolog, prob_vector, prob_epilog;
   int estimated_vf;
   int prolog_peeling = 0;
+  bool vect_epilogues
+= loop_vinfo->epilogue_vinfos.length () > 0;
   /* We currently do not support prolog peeling if the target alignment is not
  known at compile time.  'vect_gen_prolog_loop_niters' depends on the
  target alignment being constant.  */
@@ -2469,12 +2479,12 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   /* Prolog loop may be skipped.  */
   bool skip_prolog = (prolog_peeling != 0);
   /* Skip to epilog if scalar loop may be preferred.  It's only needed
- when we peel for epilog loop and when it hasn't been checked with
- loop versioning.  */
+ when we peel for epilog loop or when we loop version.  */
   bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 		  ? maybe_lt 

Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-09-04 Thread Jeff Law
On 8/23/19 10:50 AM, Andre Vieira (lists) wrote:
> Hi,
> 
> This patch is an improvement on my last RFC.  As you pointed out, we can
> do the vectorization analysis of the epilogues before doing the
> transformation, using the same approach as used by openmp simd.  I have
> not yet incorporated the cost tweaks for vectorizing the epilogue, I
> would like to do this in a subsequent patch, to make it easier to test
> the differences.
> 
> I currently disable the vectorization of epilogues when versioning for
> iterations.  This is simply because I do not completely understand how
> the assumptions are created and couldn't determine whether using
> skip_vectors with this would work.  If you don't think it is a problem
> or have a testcase to show it work I would gladly look at it.
> 
> Bootstrapped this and the next patch on x86_64 and
> aarch64-unknown-linux-gnu, with no regressions (after test changes in
> next patch).
> 
> gcc/ChangeLog:
> 2019-08-23  Andre Vieira  
> 
>  PR 88915
>  * gentype.c (main): Add poly_uint64 type to generator.
>  * tree-vect-loop.c (vect_analyze_loop_2): Make it determine
>  whether we vectorize epilogue loops.
>  (vect_analyze_loop): Idem.
>  (vect_transform_loop): Pass decision to vectorize epilogues
>  to vect_do_peeling.
>  * tree-vect-loop-manip.c (vect_do_peeling): Enable skip-vectors
>  when doing loop versioning if we decided to vectorize epilogues.
>  (vect-loop_versioning): Moved decision to check_profitability
>  based on cost model.
>  * tree-vectorizer.h (vect_loop_versioning, vect_do_peeling,
>  vect_analyze_loop, vect_transform_loop): Update declarations.
>  * tree-vectorizer.c: Include params.h
>  (try_vectorize_loop_1): Initialize vect_epilogues_nomask
>  to PARAM_VECT_EPILOGUES_NOMASK and pass it to vect_analyze_loop
>  and vect_transform_loop.  Also make sure vectorizing epilogues
>  does not count towards number of vectorized loops.
Nit.  In several places you use "epilog", proper spelling is "epilogue".



> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> index 
> 173e6b51652fd023893b38da786ff28f827553b5..25c3fc8ff55e017ae0b971fa93ce8ce2a07cb94c
>  100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -1013,8 +1015,13 @@ try_vectorize_loop_1 (hash_table 
> *_to_vf_htab,
>  
>/* Epilogue of vectorized loop must be vectorized too.  */
>if (new_loop)
> -ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops,
> -  new_loop, loop_vinfo, NULL, NULL);
> +{
> +  /* Don't include vectorized epilogues in the "vectorized loops" count.
> +   */
> +  unsigned dont_count = *num_vectorized_loops;
> +  ret |= try_vectorize_loop_1 (simduid_to_vf_htab, _count,
> +new_loop, loop_vinfo, NULL, NULL);
> +}
Nit.  Don't wrap a comment with just the closing */ on its own line.
Instead wrap before "count" so that.

This is fine for the trunk after fixing those nits.

jeff


Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-08-26 Thread Richard Biener
On Fri, 23 Aug 2019, Andre Vieira (lists) wrote:

> Hi,
> 
> This patch is an improvement on my last RFC.  As you pointed out, we can do
> the vectorization analysis of the epilogues before doing the transformation,
> using the same approach as used by openmp simd.  I have not yet incorporated
> the cost tweaks for vectorizing the epilogue, I would like to do this in a
> subsequent patch, to make it easier to test the differences.
> 
> I currently disable the vectorization of epilogues when versioning for
> iterations.  This is simply because I do not completely understand how the
> assumptions are created and couldn't determine whether using skip_vectors with
> this would work.  If you don't think it is a problem or have a testcase to
> show it work I would gladly look at it.

I don't think there's any problem.  Basically the versioning condition
is if (can_we_compute_niter), most of the time it is an extra
condition from niter analysis under which niter is for example zero.
This should also be the same with all vector sizes.

-   delete loop_vinfo;
+   {
+ /* Set versioning threshold of the original LOOP_VINFO 
based
+on the last vectorization of the epilog.  */
+ LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo)
+   = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+ delete loop_vinfo;
+   }

I'm not sure this works reliably since the order we process vector
sizes is under target control and not necessarily decreasing.  I think
you want to keep track of the minimum here?  Preferably separately
I guess.

>From what I see vect_analyze_loop_2 doesn't need vect_epilogues_nomask
and thus it doesn't change throughout the iteration.

   else
-   delete loop_vinfo;
+   {
+ /* Disable epilog vectorization if we can't determine the 
epilogs can
+be vectorized.  */
+ *vect_epilogues_nomask &= vectorized_loops > 1;
+ delete loop_vinfo;
+   }

and this is a bit premature and instead it should be done
just before returning success?  Maybe also storing the
epilogue vector sizes we can handle in the loop_vinfo,
thereby representing !vect_epilogues_nomask if there are no
such sizes which also means that

@@ -1013,8 +1015,13 @@ try_vectorize_loop_1 (hash_table 
*_to_vf_htab,

   /* Epilogue of vectorized loop must be vectorized too.  */
   if (new_loop)
-ret |= try_vectorize_loop_1 (simduid_to_vf_htab, 
num_vectorized_loops,
-new_loop, loop_vinfo, NULL, NULL);
+{
+  /* Don't include vectorized epilogues in the "vectorized loops" 
count.
+   */
+  unsigned dont_count = *num_vectorized_loops;
+  ret |= try_vectorize_loop_1 (simduid_to_vf_htab, _count,
+  new_loop, loop_vinfo, NULL, NULL);
+}


can be optimized to not re-check all smaller sizes (but even assert
re-analysis succeeds to the original result for the actual transform).

Otherwise this looks reasonable to me.

Thanks,
Richard.

> 
> Bootstrapped this and the next patch on x86_64 and aarch64-unknown-linux-gnu,
> with no regressions (after test changes in next patch).
> 
> gcc/ChangeLog:
> 2019-08-23  Andre Vieira  
> 
>  PR 88915
>  * gentype.c (main): Add poly_uint64 type to generator.
>  * tree-vect-loop.c (vect_analyze_loop_2): Make it determine
>  whether we vectorize epilogue loops.
>  (vect_analyze_loop): Idem.
>  (vect_transform_loop): Pass decision to vectorize epilogues
>  to vect_do_peeling.
>  * tree-vect-loop-manip.c (vect_do_peeling): Enable skip-vectors
>  when doing loop versioning if we decided to vectorize epilogues.
>  (vect-loop_versioning): Moved decision to check_profitability
>  based on cost model.
>  * tree-vectorizer.h (vect_loop_versioning, vect_do_peeling,
>  vect_analyze_loop, vect_transform_loop): Update declarations.
>  * tree-vectorizer.c: Include params.h
>  (try_vectorize_loop_1): Initialize vect_epilogues_nomask
>  to PARAM_VECT_EPILOGUES_NOMASK and pass it to vect_analyze_loop
>  and vect_transform_loop.  Also make sure vectorizing epilogues
>  does not count towards number of vectorized loops.
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 247165 (AG München)

[PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops

2019-08-23 Thread Andre Vieira (lists)

Hi,

This patch is an improvement on my last RFC.  As you pointed out, we can 
do the vectorization analysis of the epilogues before doing the 
transformation, using the same approach as used by openmp simd.  I have 
not yet incorporated the cost tweaks for vectorizing the epilogue, I 
would like to do this in a subsequent patch, to make it easier to test 
the differences.


I currently disable the vectorization of epilogues when versioning for 
iterations.  This is simply because I do not completely understand how 
the assumptions are created and couldn't determine whether using 
skip_vectors with this would work.  If you don't think it is a problem 
or have a testcase to show it work I would gladly look at it.


Bootstrapped this and the next patch on x86_64 and 
aarch64-unknown-linux-gnu, with no regressions (after test changes in 
next patch).


gcc/ChangeLog:
2019-08-23  Andre Vieira  

 PR 88915
 * gentype.c (main): Add poly_uint64 type to generator.
 * tree-vect-loop.c (vect_analyze_loop_2): Make it determine
 whether we vectorize epilogue loops.
 (vect_analyze_loop): Idem.
 (vect_transform_loop): Pass decision to vectorize epilogues
 to vect_do_peeling.
 * tree-vect-loop-manip.c (vect_do_peeling): Enable skip-vectors
 when doing loop versioning if we decided to vectorize epilogues.
 (vect-loop_versioning): Moved decision to check_profitability
 based on cost model.
 * tree-vectorizer.h (vect_loop_versioning, vect_do_peeling,
 vect_analyze_loop, vect_transform_loop): Update declarations.
 * tree-vectorizer.c: Include params.h
 (try_vectorize_loop_1): Initialize vect_epilogues_nomask
 to PARAM_VECT_EPILOGUES_NOMASK and pass it to vect_analyze_loop
 and vect_transform_loop.  Also make sure vectorizing epilogues
 does not count towards number of vectorized loops.

diff --git a/gcc/gengtype.c b/gcc/gengtype.c
index 53317337cf8c8e8caefd6b819d28b3bba301e755..56ffa08a7dee54837441f0c743f8c0faa285c74b 100644
--- a/gcc/gengtype.c
+++ b/gcc/gengtype.c
@@ -5197,6 +5197,7 @@ main (int argc, char **argv)
   POS_HERE (do_scalar_typedef ("widest_int", ));
   POS_HERE (do_scalar_typedef ("int64_t", ));
   POS_HERE (do_scalar_typedef ("poly_int64", ));
+  POS_HERE (do_scalar_typedef ("poly_uint64", ));
   POS_HERE (do_scalar_typedef ("uint64_t", ));
   POS_HERE (do_scalar_typedef ("uint8", ));
   POS_HERE (do_scalar_typedef ("uintptr_t", ));
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 5c25441c70a271f04730486e513437fffa75b7e3..3b5f14c45b5b9b601120c6776734bbafefe1e178 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2401,7 +2401,8 @@ class loop *
 vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 		 tree *niters_vector, tree *step_vector,
 		 tree *niters_vector_mult_vf_var, int th,
-		 bool check_profitability, bool niters_no_overflow)
+		 bool check_profitability, bool niters_no_overflow,
+		 bool vect_epilogues_nomask)
 {
   edge e, guard_e;
   tree type = TREE_TYPE (niters), guard_cond;
@@ -2474,7 +2475,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 		  ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo),
   bound_prolog + bound_epilog)
-		  : !LOOP_REQUIRES_VERSIONING (loop_vinfo));
+		  : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+			 || vect_epilogues_nomask));
   /* Epilog loop must be executed if the number of iterations for epilog
  loop is known at compile time, otherwise we need to add a check at
  the end of vector loop and skip to the end of epilog loop.  */
@@ -2966,9 +2968,7 @@ vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr)
*COND_EXPR_STMT_LIST.  */
 
 class loop *
-vect_loop_versioning (loop_vec_info loop_vinfo,
-		  unsigned int th, bool check_profitability,
-		  poly_uint64 versioning_threshold)
+vect_loop_versioning (loop_vec_info loop_vinfo)
 {
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop;
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
@@ -2988,10 +2988,15 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   bool version_align = LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT (loop_vinfo);
   bool version_alias = LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo);
   bool version_niter = LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo);
+  poly_uint64 versioning_threshold
+= LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
   tree version_simd_if_cond
 = LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (loop_vinfo);
+  unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
 
-  if (check_profitability)
+  if (th >= vect_vf_for_cost (loop_vinfo)
+  && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+  && !ordered_p (th, versioning_threshold))
 cond_expr = fold_build2