Hi,
This patch is a new approach to fix PR66388.  IVO today computes iv_use with
iv_cand which has at least same type precision as the use.  On 64bit
platforms like AArch64, this results in different iv_cand created for each
address type iv_use, and register pressure increased.  As a matter of fact,
the BIV should be used for all iv_uses in some of these cases.  It is a
latent bug but recently getting worse because of overflow changes.

The original approach at
https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01484.html can fix the issue
except it conflict with IV elimination.  Seems to me it is impossible to
mitigate the contradiction.

This new approach fixes the issue by adding sizetype iv_cand for BIVs
directly.  In cases if the original BIV is preferred, the sizetype iv_cand
will be chosen.  As for code generation, the sizetype iv_cand has the same
effect as the original BIV.  Actually, it's better because BIV needs to be
explicitly extended to sizetype to be used in address expression on most
targets.

One shortage of this approach is it may introduce more iv candidates.  To
minimize the impact, this patch does sophisticated code analysis and adds
sizetype candidate for BIV only if it is used as index.  Moreover, it avoids
to add candidate of the original type if the BIV is only used as index.
Statistics for compiling spec2k6 shows increase of candidate number is
modest and can be ignored.

There are two more patches following to fix corner cases revealed by this
one.  In together they bring obvious perf improvement for spec26k/int on
aarch64.
Spec2k6/int
400.perlbench   3.44%
445.gobmk       -0.86%
456.hmmer       14.83%
458.sjeng       2.49%
462.libquantum  -0.79%
GEOMEAN         1.68%

There is also about 0.36% improvement for spec2k6/fp, mostly because of case
436.cactusADM.  I believe it can be further improved, but that should be
another patch.

I also collected benchmark data for x86_64.  Spec2k6/fp is not affected.  As
for spec2k6/int, though the geomean is improved slightly, 400.perlbench is
regressed by ~3%.  I can see BIVs are chosen for some loops instead of
address candidates.  Generally, the loop header will be simplified because
iv elimination with BIV is simpler; the number of instructions in loop body
isn't changed.  I suspect the regression comes from different addressing
modes.  With BIV, complex addressing mode like [base + index << scale +
disp] is used, rather than [base + disp].  I guess the former has more
micro-ops, thus more expensive.  This guess can be confirmed by manually
suppressing the complex addressing mode with higher address cost.
Now the problem becomes why overall cost of BIV is computed lower while the
actual cost is higher.  I noticed for most affected loops, loop header is
bloated because of iv elimination using the old address candidate.  The
bloated loop header results in much higher cost than BIV.  As a result, BIV
is preferred.  I also noticed the bloated loop header generally can be
simplified (I have a following patch for this).  After applying the local
patch, the old address candidate is chosen, and most of regression is
recovered.
Conclusion is I think loop header bloated issue should be blamed for the
regression, and it can be resolved.

Bootstrap and test on x64_64 and aarch64.  It fixes failure of
gcc.target/i386/pr49781-1.c, without new breakage.

So what do you think?

Thanks,
bin

2015-08-31  Bin Cheng  <bin.ch...@arm.com>

        * tree-affine.c (aff_combination_expand): New parameters.
        (tree_to_aff_combination_expand): Ditto.
        * tree-affine.h (aff_combination_expand): New declaration.
        (tree_to_aff_combination_expand): Ditto.
        * tree-ssa-loop-ivopts.c (struct iv, iv_cand): New fields.
        (dump_iv): Dump no_overflow information.
        (alloc_iv): Initialize new field for struct iv.
        (struct expand_data): New struct for affine combination expanding.
        (stop_expand): New callback func for affine combination expanding.
        (find_deriving_biv_for_iv, record_biv_for_address_use): New
functions.
        (idx_find_step): Call new functions above.
        (find_depends, add_candidate): New paramter.
        (add_iv_candidate_for_biv): Add sizetype cand for BIV.
        (get_computation_aff): Simplify convertion of cand for BIV.
        (get_computation_cost_at): Step cand's base if necessary.

Index: gcc/tree-affine.c
===================================================================
--- gcc/tree-affine.c   (revision 227163)
+++ gcc/tree-affine.c   (working copy)
@@ -625,11 +656,14 @@ struct name_expansion
 };
 
 /* Expands SSA names in COMB recursively.  CACHE is used to cache the
-   results.  */
+   results.  If callback function STOP is specified, this function stops
+   expanding when it returns TRUE.  DATA points to private structure
+   used by the callback function.  */
 
 void
 aff_combination_expand (aff_tree *comb ATTRIBUTE_UNUSED,
-                       hash_map<tree, name_expansion *> **cache)
+                       hash_map<tree, name_expansion *> **cache,
+                       bool (*stop) (void *, void *), void *data)
 {
   unsigned i;
   aff_tree to_add, current, curre;
@@ -654,6 +688,11 @@ aff_combination_expand (aff_tree *comb ATTRIBUTE_U
        name = TREE_OPERAND (e, 0);
       if (TREE_CODE (name) != SSA_NAME)
        continue;
+
+      /* Don't expand further if STOP returns TRUE.  */
+      if (stop != NULL && (*stop) (data, name))
+       continue;
+
       def = SSA_NAME_DEF_STMT (name);
       if (!is_gimple_assign (def) || gimple_assign_lhs (def) != name)
        continue;
@@ -702,7 +741,8 @@ aff_combination_expand (aff_tree *comb ATTRIBUTE_U
              if (e != name)
                rhs = fold_convert (type, rhs);
            }
-         tree_to_aff_combination_expand (rhs, comb->type, &current, cache);
+         tree_to_aff_combination_expand (rhs, comb->type,
+                                         &current, cache, stop, data);
          exp->expansion = current;
          exp->in_progress = 0;
        }
@@ -735,14 +775,19 @@ aff_combination_expand (aff_tree *comb ATTRIBUTE_U
    a1 = a0 + a0;
    a2 = a1 + a1;
    a3 = a2 + a2;
-   ...  */
+   ...
+
+   If callback function STOP is specified, this function stops expanding
+   when it returns TRUE.  DATA points to private structure used by the
+   callback function.  */
 
 void
 tree_to_aff_combination_expand (tree expr, tree type, aff_tree *comb,
-                               hash_map<tree, name_expansion *> **cache)
+                               hash_map<tree, name_expansion *> **cache,
+                               bool (*stop) (void *, void *), void *data)
 {
   tree_to_aff_combination (expr, type, comb);
-  aff_combination_expand (comb, cache);
+  aff_combination_expand (comb, cache, stop, data);
 }
 
 /* Frees memory occupied by struct name_expansion in *VALUE.  Callback for
Index: gcc/tree-affine.h
===================================================================
--- gcc/tree-affine.h   (revision 227163)
+++ gcc/tree-affine.h   (working copy)
@@ -77,9 +77,12 @@ void tree_to_aff_combination (tree, tree, aff_tree
 tree aff_combination_to_tree (aff_tree *);
 void unshare_aff_combination (aff_tree *);
 bool aff_combination_constant_multiple_p (aff_tree *, aff_tree *, widest_int 
*);
-void aff_combination_expand (aff_tree *, hash_map<tree, name_expansion *> **);
+void aff_combination_expand (aff_tree *, hash_map<tree, name_expansion *> **,
+                            bool (*)(void *, void *) = NULL, void * = NULL);
 void tree_to_aff_combination_expand (tree, tree, aff_tree *,
-                                    hash_map<tree, name_expansion *> **);
+                                    hash_map<tree, name_expansion *> **,
+                                    bool (*)(void *, void *) = NULL,
+                                    void * = NULL);
 tree get_inner_reference_aff (tree, aff_tree *, widest_int *);
 void free_affine_expand_cache (hash_map<tree, name_expansion *> **);
 bool aff_comb_cannot_overlap_p (aff_tree *, const widest_int &,
Index: gcc/tree-ssa-loop-ivopts.c
===================================================================
--- gcc/tree-ssa-loop-ivopts.c  (revision 227163)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -147,6 +147,8 @@ struct iv
   bool biv_p;          /* Is it a biv?  */
   bool have_use_for;   /* Do we already have a use for it?  */
   bool no_overflow;    /* True if the iv doesn't overflow.  */
+  bool have_address_use;/* For biv, indicate if it's used in any address
+                          type use.  */
 };
 
 /* Per-ssa version information (induction variable descriptions, etc.).  */
@@ -251,6 +253,8 @@ struct iv_cand
                              where it is incremented.  */
   bitmap depends_on;   /* The list of invariants that are used in step of the
                           biv.  */
+  struct iv *orig_iv;  /* The original iv if this cand is added from biv with
+                          smaller type.  */
 };
 
 /* Loop invariant expression hashtable entry.  */
@@ -529,6 +533,9 @@ dump_iv (FILE *file, struct iv *iv, bool dump_name
 
   if (iv->biv_p)
     fprintf (file, "  is a biv\n");
+
+  if (iv->no_overflow)
+    fprintf (file, "  iv doesn't overflow wrto loop niter\n");
 }
 
 /* Dumps information about the USE to FILE.  */
@@ -1013,6 +1020,7 @@ alloc_iv (struct ivopts_data *data, tree base, tre
   iv->use_id = 0;
   iv->ssa_name = NULL_TREE;
   iv->no_overflow = no_overflow;
+  iv->have_address_use = false;
 
   return iv;
 }
@@ -1621,6 +1629,106 @@ expr_invariant_in_loop_p (struct loop *loop, tree
   return true;
 }
 
+/* Local data for affine combination expanding.  */
+
+struct expand_data
+{
+  struct ivopts_data *data;
+  struct iv *biv;
+};
+
+/* Callback function to tree_to_aff_combination_expand.
+
+   Return true if biv or loop invariant are encountered, false otherwise.  */
+
+static bool
+stop_expand (void *d1, void *d2)
+{
+  struct expand_data *exp_data = (struct expand_data *)d1;
+  tree ssa_name = (tree)d2;
+  struct ivopts_data *data;
+  struct iv *iv;
+
+  if (!exp_data || !ssa_name || TREE_CODE (ssa_name) != SSA_NAME)
+    return false;
+
+  data = exp_data->data;
+  gcc_assert (data != NULL);
+  iv = get_iv (data, ssa_name);
+  if (!iv || integer_zerop (iv->step))
+    return true;
+
+  if (iv->biv_p)
+    {
+      exp_data->biv = iv;
+      return true;
+    }
+
+  return false;
+}
+
+/* Return biv from which the IV is derived.  */
+
+static struct iv *
+find_deriving_biv_for_iv (struct ivopts_data *data, struct iv *iv)
+{
+  aff_tree aff;
+  struct expand_data exp_data;
+
+  if (!iv->ssa_name || TREE_CODE (iv->ssa_name) != SSA_NAME)
+    return iv;
+
+  /* Expand IV's ssa_name till the deriving biv is found.  */
+  exp_data.data = data;
+  exp_data.biv = NULL;
+  tree_to_aff_combination_expand (iv->ssa_name, TREE_TYPE (iv->ssa_name),
+                                 &aff, &data->name_expansion_cache,
+                                 stop_expand, &exp_data);
+  return exp_data.biv;
+}
+
+/* Record BIV, its predecessor and successor that they are used in
+   address type uses.  */
+
+static void
+record_biv_for_address_use (struct ivopts_data *data, struct iv *biv)
+{
+  unsigned i;
+  tree type, base_1, base_2;
+  bitmap_iterator bi;
+
+  if (!biv || !biv->biv_p || integer_zerop (biv->step)
+      || biv->have_address_use || !biv->no_overflow)
+    return;
+
+  type = TREE_TYPE (biv->base);
+  if (!INTEGRAL_TYPE_P (type))
+    return;
+
+  biv->have_address_use = true;
+  base_1 = fold_build2 (PLUS_EXPR, type, biv->base, biv->step);
+  EXECUTE_IF_SET_IN_BITMAP (data->relevant, 0, i, bi)
+    {
+      struct iv *iv = ver_info (data, i)->iv;
+
+      if (!iv || !iv->biv_p || integer_zerop (iv->step)
+         || iv->have_address_use || !iv->no_overflow)
+       continue;
+
+      if (type != TREE_TYPE (iv->base)
+         || !INTEGRAL_TYPE_P (TREE_TYPE (iv->base)))
+       continue;
+
+      if (!operand_equal_p (biv->step, iv->step, 0))
+       continue;
+
+      base_2 = fold_build2 (PLUS_EXPR, type, iv->base, iv->step);
+      if (operand_equal_p (base_1, iv->base, 0)
+         || operand_equal_p (base_2, biv->base, 0))
+       iv->have_address_use = true;
+    }
+}
+
 /* Cumulates the steps of indices into DATA and replaces their values with the
    initial ones.  Returns false when the value of the index cannot be 
determined.
    Callback for for_each_index.  */
@@ -1711,6 +1819,10 @@ idx_find_step (tree base, tree *idx, void *data)
   step = fold_build2 (MULT_EXPR, sizetype, step, iv_step);
   dta->step = fold_build2 (PLUS_EXPR, sizetype, dta->step, step);
 
+  if (!iv->biv_p)
+    iv = find_deriving_biv_for_iv (dta->ivopts_data, iv);
+
+  record_biv_for_address_use (dta->ivopts_data, iv);
   return true;
 }
 
@@ -2603,7 +2715,8 @@ find_depends (tree *expr_p, int *ws ATTRIBUTE_UNUS
 static struct iv_cand *
 add_candidate_1 (struct ivopts_data *data,
                 tree base, tree step, bool important, enum iv_position pos,
-                struct iv_use *use, gimple incremented_at)
+                struct iv_use *use, gimple incremented_at,
+                struct iv *orig_iv = NULL)
 {
   unsigned i;
   struct iv_cand *cand = NULL;
@@ -2685,6 +2798,7 @@ add_candidate_1 (struct ivopts_data *data,
       else
        cand->ainc_use = NULL;
 
+      cand->orig_iv = orig_iv;
       if (dump_file && (dump_flags & TDF_DETAILS))
        dump_cand (dump_file, cand);
     }
@@ -2793,15 +2907,17 @@ add_autoinc_candidates (struct ivopts_data *data,
 
 static void
 add_candidate (struct ivopts_data *data,
-              tree base, tree step, bool important, struct iv_use *use)
+              tree base, tree step, bool important, struct iv_use *use,
+              struct iv *orig_iv = NULL)
 {
   gcc_assert (use == NULL || use->sub_id == 0);
 
   if (ip_normal_pos (data->current_loop))
-    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL);
+    add_candidate_1 (data, base, step, important,
+                    IP_NORMAL, use, NULL, orig_iv);
   if (ip_end_pos (data->current_loop)
       && allow_ip_end_pos_p (data->current_loop))
-    add_candidate_1 (data, base, step, important, IP_END, use, NULL);
+    add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv);
 }
 
 /* Adds standard iv candidates.  */
@@ -2836,8 +2952,24 @@ add_iv_candidate_for_biv (struct ivopts_data *data
   tree def;
   struct iv_cand *cand;
 
-  add_candidate (data, iv->base, iv->step, true, NULL);
+  /* Check if this biv is used in address type use.  */
+  if (iv->no_overflow  && iv->have_address_use
+      && INTEGRAL_TYPE_P (TREE_TYPE (iv->base))
+      && TYPE_PRECISION (TREE_TYPE (iv->base)) < TYPE_PRECISION (sizetype))
+    {
+      tree type = unsigned_type_for (sizetype);
+      tree base = fold_convert (type, iv->base);
+      tree step = fold_convert (type, iv->step);
 
+      /* Add iv cand of same precision as index part in TARGET_MEM_REF.  */
+      add_candidate (data, base, step, true, NULL, iv);
+      /* Add iv cand of the original type only if it has nonlinear use.  */
+      if (iv->have_use_for)
+       add_candidate (data, iv->base, iv->step, true, NULL);
+    }
+  else
+    add_candidate (data, iv->base, iv->step, true, NULL);
+
   /* The same, but with initial value zero.  */
   if (POINTER_TYPE_P (TREE_TYPE (iv->base)))
     add_candidate (data, size_int (0), iv->step, true, NULL);
@@ -3358,6 +3490,28 @@ get_computation_aff (struct loop *loop,
   /* If the conversion is not noop, perform it.  */
   if (TYPE_PRECISION (utype) < TYPE_PRECISION (ctype))
     {
+      if (cand->orig_iv != NULL && CONVERT_EXPR_P (cbase)
+         && (CONVERT_EXPR_P (cstep) || TREE_CODE (cstep) == INTEGER_CST))
+       {
+         tree inner_base, inner_step, inner_type;
+         inner_base = TREE_OPERAND (cbase, 0);
+         if (CONVERT_EXPR_P (cstep))
+           inner_step = TREE_OPERAND (cstep, 0);
+         else
+           inner_step = cstep;
+
+         inner_type = TREE_TYPE (inner_base);
+         /* If candidate is added from a biv whose type is smaller than
+            ctype, we know both candidate and the biv won't overflow.
+            In this case, it's safe to skip the convertion in candidate.
+            As an example, (unsigned short)((unsigned long)A) equals to
+            (unsigned short)A, if A has a type no larger than short.  */
+         if (TYPE_PRECISION (inner_type) <= TYPE_PRECISION (uutype))
+           {
+             cbase = inner_base;
+             cstep = inner_step;
+           }
+       }
       cstep = fold_convert (uutype, cstep);
       cbase = fold_convert (uutype, cbase);
       var = fold_convert (uutype, var);
@@ -4525,6 +4681,13 @@ get_computation_cost_at (struct ivopts_data *data,
                (ratio, mem_mode,
                        TYPE_ADDR_SPACE (TREE_TYPE (utype))))
     {
+      if (cstepi == 0 && stmt_is_after_inc)
+       {
+         if (POINTER_TYPE_P (ctype))
+           cbase = fold_build2 (POINTER_PLUS_EXPR, ctype, cbase, cstep);
+         else
+           cbase = fold_build2 (PLUS_EXPR, ctype, cbase, cstep);
+       }
       cbase
        = fold_build2 (MULT_EXPR, ctype, cbase, build_int_cst (ctype, ratio));
       cost = difference_cost (data,

Reply via email to