[gcc r16-101] Enable ip-cp cloning over non-hot edges

Jan Hubicka via Gcc-cvs Wed, 23 Apr 2025 09:41:38 -0700

https://gcc.gnu.org/g:132d01d96ea9d617aaffdd5dfba3284a8958e529


commit r16-101-g132d01d96ea9d617aaffdd5dfba3284a8958e529
Author: Jan Hubicka <hubi...@ucw.cz>
Date:   Wed Apr 23 18:39:14 2025 +0200

    Enable ip-cp cloning over non-hot edges
    
    Currently enabling profile feedback regresses x264 and exchange. In both 
cases the root of the
    issue is that ipa-cp cost model thinks cloning is not relevant when 
feedback is available
    while it clones without feedback.
    
    Consider:
    
    __attribute__ ((used))
    int a[1000];
    
    __attribute__ ((noinline))
    void
    test2(int sz)
    {
      for (int i = 0; i < sz; i++)
              a[i]++;
      asm volatile (""::"m"(a));
    }
    
    __attribute__ ((noinline))
    void
    test1 (int sz)
    {
      for (int i = 0; i < 1000; i++)
              test2(sz);
    }
    int main()
    {
            test1(1000);
            return 0;
    }
    
    Here we want to clone call both test1 and test2 and specialize for 1000, but
    ipa-cp will not do that, since it will skip call main->test1 as not hot 
since
    it is called just once both with or without profile feedback.
    In this simple testcase even without profile feedback we will track that 
main
    is called once.
    
    I think the testcase shows that hotness of call is not that relevant when
    deciding whether we want to propagate constants across it.  ipa-cp with IPA
    profile can compute overall estimate of time saved (which is existing time
    benefit computing time saved per invociation of the function multiplied by
    number of executions) and see if result is big enough. An easy check is to
    simply call maybe_hot_p on the resulting count.
    
    So this patch makes ipa-cp to consider all calls sites except those known 
to be
    unlikely executed (i.e. run 0 times in train run or known to lead to 
someting
    bad) as interesting, which makes ipa-cp to propagate across them, find 
cloning
    candidates and feed them into good_clonning_oppurtunity.
    
    For this I added cs_interesting_for_ipcp_p which also attempts to do right
    thing with partial training.
    
    Now good_clonning_oppurtunity will currently return false, since it will 
figure
    out that the call edge is not very frequent.
    It already kind of knows that frequency of call instruction istself is not 
too
    important, but instead of computing overall time saved, it tries to compare 
it
    with param_ipa_cp_profile_count_base percentage of counts of call edges.  I
    think this is not very relevant since estimated time saved per call can be
    large.  So I dropped this logic and replaced it with simple use of overall
    saved time.
    
    Since ipa-cp is not dealing well with the cases where it hits the allowed 
unit
    growth limit, we probably want to be more careful, so I keep existing metric
    with this change.
    
    So now we get:
    
    Evaluating opportunities for test1/3.
     - considering value 1000 for param #0 sz (caller_count: 1)
         good_cloning_opportunity_p (time: 1, size: 8, count_sum: 1 (precise), 
overall time saved: 1 (adjusted)) -> evaluation: 0.12, threshold: 500
         not cloning: time saved is not hot
         good_cloning_opportunity_p (time: 129001, size: 20, count_sum: 1 
(precise), overall time saved: 129001 (adjusted)) -> evaluation: 6450.05, 
threshold: 500
    
    First call to good_cloning_oppurtunity considers the case where only test1 
is
    clonned. In this case time saved is 1 (for passing the value around) and 
since
    it is called just once (count_sum) overall time saved is 1 which is not
    considered hot and we also get very low evaulation score.
    
    In the second call we consider cloning chain test1->test2.  In this case 
time
    saved is large (12901) since test2 is invoked many times and it is used to
    controll the loop.  We still know that the count is 1 but overall time is
    129001 which is already considered relevant and we clone.
    
    I also try to do something sensible in case we have calls both with
    and without IPA profile (which can happen for comdats where profile got 
missing
    or with LTO if some units were not trained).
    Instead of checking whether sum of calls with known profile is nonzero, I 
keep
    track if there are other calls and if so, also try the local heuristics that
    is used without profile feedback.
    
    The patch improves SPECint with -Ofast -fprofile-use by approx 1% by 
speeding
    up x264 from 99.3s to 91.3s (9%) and exchange from 99.7s to 95.5s (3.3%).
    
    We still get better x264 runtime without profile (86.4s for x264 and 93.8 
for exchange).
    
    The main problem I see is that ipa-cp has the global limit for growth of 10%
    but does not consider the oppurtunities in priority order.  Consequently if 
the
    limit is hit, randomly some clone oppurtunities are dropped in favour of
    others.
    
    I dumped unit size changes with -flto -Ofast build of SPEC2017. Without 
patch I get:
    
    orig    new     growth
    588677  605385  102.838229
    4378    6037    137.894016
    484650  494851  102.104818
    4111    4111    100.000000
    99953   103519  103.567677
    106181  114889  108.201091
    21389   21597   100.972462
    24925   26746   107.305918
    15308   23974   156.610922
    27354   27906   102.017986
    494     494     100.000000
    4631    4631    100.000000
    863216  872729  101.102042
    126604  126604  100.000000
    605138  627156  103.638509
    4112    4112    100.000000
    222006  231293  104.183220
    2952    3384    114.634146
    37584   39807   105.914751
    4111    4111    100.000000
    13226   13226   100.000000
    4111    4111    100.000000
    326215  337396  103.427494
    25240   25433   100.764659
    64644   65972   102.054328
    127223  132300  103.990631
    494     494     100.000000
    
    Small units can grow up to 16000 instructions and other units are
    large. So there is only one 156% growth hititng limits which is exchange
    that has recursive clonning that goes specially.
    
    With profile feedback ipacp basically shuts itself off:
    
    333815  333891  100.022767
    2559    2974    116.217272
    217576  217581  100.002298
    2749    2749    100.000000
    64652   64716   100.098992
    68416   69707   101.886986
    13171   13171   100.000000
    11849   11849   100.000000
    10519   16180   153.816903
    15843   15843   100.000000
    231     231     100.000000
    3624    3624    100.000000
    573385  573386  100.000174
    97623   97623   100.000000
    295673  295676  100.001015
    2750    2750    100.000000
    130723  130726  100.002295
    2334    2334    100.000000
    19313   19313   100.000000
    2749    2749    100.000000
    517331  517331  100.000000
    6707    6707    100.000000
    2749    2749    100.000000
    193638  193638  100.000000
    16425   16425   100.000000
    47154   47154   100.000000
    96422   96422   100.000000
    231     231     100.000000
    
    So we essentially clone only exchange and and mcf (116%)
    With patch and no FDO I get:
    
    588677  605385  102.838229
    4378    6037    137.894016
    484519  494698  102.100846
    4111    4111    100.000000
    99953   103519  103.567677
    106181  114889  108.201091
    21389   22632   105.811398
    24854   26620   107.105496
    15308   23974   156.610922
    27354   28039   102.504204
    494     494     100.000000
    4631    4631    100.000000
    4631    4631    100.000000
    126604  126630  100.020536
    4112    4112    100.000000
    222006  231293  104.183220
    2952    3384    114.634146
    37584   39807   105.914751
    2760715 2835539 102.710312
    4111    4111    100.000000
    13226   13226   100.000000
    4111    4111    100.000000
    326215  337396  103.427494
    25240   25433   100.764659
    64644   65972   102.054328
    127223  132300  103.990631
    494     494     100.000000
    
    which seems essentially same as without patch. However with FDO I get:
    333815  350363  104.957237
    2559    3345    130.715123
    217469  220765  101.515618
    485599  488772  100.653420
    2749    2749    100.000000
    64652   74265   114.868836
    68416   87484   127.870674
    13171   20656   156.829398
    11792   11990   101.679104
    10519   17028   161.878506
    15843   16119   101.742094
    231     231     100.000000
    573336  573336  100.000000
    97623   97623   100.000000
    295497  296208  100.240612
    2750    2750    100.000000
    130723  133341  102.002708
    2334    2334    100.000000
    19313   19368   100.284782
    2749    2749    100.000000
    6707    6755    100.715670
    2749    2749    100.000000
    193638  194712  100.554643
    16425   17377   105.796043
    47154   47154   100.000000
    96422   96422   100.000000
    231     231     100.000000
    
    So here we get 114% and 127 growth in x264 (two differen tbinaries)
    56% growht in Deepsjeng, 61% growth in Exchange which all are above
    10% cutoff.
    
    Bootstrapped/regtested x86_64-linux.
    
    gcc/ChangeLog:
    
            * ipa-cp.cc (base_count): Remove.
            (struct caller_statistics): Rename n_hot_calls to 
n_interesting_calls;
            add called_without_ipa_profile.
            (init_caller_stats): Update.
            (cs_interesting_for_ipcp_p): New function.
            (gather_caller_stats): collect n_interesting_calls and
            called_without_profile.
            (ipcp_cloning_candidate_p): Use n_interesting-calls rather then hot.
            (good_cloning_opportunity_p): Rewrite heuristics when IPA profile is
            present
            (estimate_local_effects): Update.
            (value_topo_info::propagate_effects): Update.
            (compare_edge_profile_counts): Remove.
            (ipcp_propagate_stage): Do not collect base_count.
            (get_info_about_necessary_edges): Record whether function is called
            without profile.
            (decide_about_value): Update.
            (ipa_cp_cc_finalize): Do not initialie base_count.
            * profile-count.cc (profile_count::operator*): New.
            (profile_count::operator*=): New.
            * profile-count.h (profile_count::operator*): Declare
            (profile_count::operator*=): Declare.
            * params.opt: Remove ipa-cp-profile-count-base.
            * doc/invoke.texi: Likewise.

Diff:
---
 gcc/doc/invoke.texi                       |   5 -
 gcc/ipa-cp.cc                             | 247 +++++++++++++-----------------
 gcc/params.opt                            |   4 -
 gcc/profile-count.cc                      |  23 +++
 gcc/profile-count.h                       |   3 +
 gcc/testsuite/gcc.dg/ipa/ipa-clone-4.c    |  30 ++++
 gcc/testsuite/gcc.dg/tree-prof/ipa-cp-1.c |  30 ++++
 7 files changed, 192 insertions(+), 150 deletions(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 88fb9bd3d7fd..a0f60e736e18 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -16897,11 +16897,6 @@ Maximum depth of recursive cloning for self-recursive 
function.
 Recursive cloning only when the probability of call being executed exceeds
 the parameter.
 
-@item ipa-cp-profile-count-base
-When using @option{-fprofile-use} option, IPA-CP will consider the measured
-execution count of a call graph edge at this percentage position in their
-histogram as the basis for its heuristics calculation.
-
 @item ipa-cp-recursive-freq-factor
 The number of times interprocedural copy propagation expects recursive
 functions to call themselves.
diff --git a/gcc/ipa-cp.cc b/gcc/ipa-cp.cc
index 806c2bdc97f2..abde64b6f296 100644
--- a/gcc/ipa-cp.cc
+++ b/gcc/ipa-cp.cc
@@ -147,10 +147,6 @@ object_allocator<ipcp_value_source<tree> > 
ipcp_sources_pool
 object_allocator<ipcp_agg_lattice> ipcp_agg_lattice_pool
   ("IPA_CP aggregate lattices");
 
-/* Base count to use in heuristics when using profile feedback.  */
-
-static profile_count base_count;
-
 /* Original overall size of the program.  */
 
 static long overall_size, orig_overall_size;
@@ -505,14 +501,16 @@ struct caller_statistics
   profile_count count_sum;
   /* Sum of all frequencies for all calls.  */
   sreal freq_sum;
-  /* Number of calls and hot calls respectively.  */
-  int n_calls, n_hot_calls;
+  /* Number of calls and calls considered interesting respectively.  */
+  int n_calls, n_interesting_calls;
   /* If itself is set up, also count the number of non-self-recursive
      calls.  */
   int n_nonrec_calls;
   /* If non-NULL, this is the node itself and calls from it should have their
      counts included in rec_count_sum and not count_sum.  */
   cgraph_node *itself;
+  /* True if there is a caller that has no IPA profile.  */
+  bool called_without_ipa_profile;
 };
 
 /* Initialize fields of STAT to zeroes and optionally set it up so that edges
@@ -524,10 +522,39 @@ init_caller_stats (caller_statistics *stats, cgraph_node 
*itself = NULL)
   stats->rec_count_sum = profile_count::zero ();
   stats->count_sum = profile_count::zero ();
   stats->n_calls = 0;
-  stats->n_hot_calls = 0;
+  stats->n_interesting_calls = 0;
   stats->n_nonrec_calls = 0;
   stats->freq_sum = 0;
   stats->itself = itself;
+  stats->called_without_ipa_profile = false;
+}
+
+/* We want to propagate across edges that may be executed, however
+   we do not want to check maybe_hot, since call itself may be cold
+   while calee contains some heavy loop which makes propagation still
+   relevant.
+
+   In particular, even edge called once may lead to significant
+   improvement.  */
+
+static bool
+cs_interesting_for_ipcp_p (cgraph_edge *e)
+{
+  /* If profile says the edge is executed, we want to optimize.  */
+  if (e->count.ipa ().nonzero_p ())
+    return true;
+  /* If local (possibly guseed or adjusted 0 profile) claims edge is
+     not executed, do not propagate.  */
+  if (!e->count.nonzero_p ())
+    return false;
+  /* If IPA profile says edge is executed zero times, but zero
+     is quality is ADJUSTED, still consider it for cloning in
+     case we have partial training.  */
+  if (e->count.ipa ().initialized_p ()
+      && opt_for_fn (e->callee->decl,flag_profile_partial_training)
+      && e->count.nonzero_p ())
+    return false;
+  return true;
 }
 
 /* Worker callback of cgraph_for_node_and_aliases accumulating statistics of
@@ -553,13 +580,18 @@ gather_caller_stats (struct cgraph_node *node, void *data)
            else
              stats->count_sum += cs->count.ipa ();
          }
+       else
+         stats->called_without_ipa_profile = true;
        stats->freq_sum += cs->sreal_frequency ();
        stats->n_calls++;
        if (stats->itself && stats->itself != cs->caller)
          stats->n_nonrec_calls++;
 
-       if (cs->maybe_hot_p ())
-         stats->n_hot_calls ++;
+       /* If profile known to be zero, we do not want to clone for performance.
+          However if call is cold, the called function may still contain
+          important hot loops.  */
+       if (cs_interesting_for_ipcp_p (cs))
+         stats->n_interesting_calls++;
       }
   return false;
 
@@ -602,26 +634,11 @@ ipcp_cloning_candidate_p (struct cgraph_node *node)
                 node->dump_name ());
       return true;
     }
-
-  /* When profile is available and function is hot, propagate into it even if
-     calls seems cold; constant propagation can improve function's speed
-     significantly.  */
-  if (stats.count_sum > profile_count::zero ()
-      && node->count.ipa ().initialized_p ())
-    {
-      if (stats.count_sum > node->count.ipa ().apply_scale (90, 100))
-       {
-         if (dump_file)
-           fprintf (dump_file, "Considering %s for cloning; "
-                    "usually called directly.\n",
-                    node->dump_name ());
-         return true;
-       }
-    }
-  if (!stats.n_hot_calls)
+  if (!stats.n_interesting_calls)
     {
       if (dump_file)
-       fprintf (dump_file, "Not considering %s for cloning; no hot calls.\n",
+       fprintf (dump_file, "Not considering %s for cloning; "
+                "no calls considered interesting by profile.\n",
                 node->dump_name ());
       return false;
     }
@@ -3369,24 +3386,29 @@ incorporate_penalties (cgraph_node *node, 
ipa_node_params *info,
 static bool
 good_cloning_opportunity_p (struct cgraph_node *node, sreal time_benefit,
                            sreal freq_sum, profile_count count_sum,
-                           int size_cost)
+                           int size_cost, bool called_without_ipa_profile)
 {
+  gcc_assert (count_sum.ipa () == count_sum);
   if (time_benefit == 0
       || !opt_for_fn (node->decl, flag_ipa_cp_clone)
-      || node->optimize_for_size_p ())
+      || node->optimize_for_size_p ()
+      /* If there is no call which was executed in profiling or where
+        profile is missing, we do not want to clone.  */
+      || (!called_without_ipa_profile && !count_sum.nonzero_p ()))
     return false;
 
   gcc_assert (size_cost > 0);
 
   ipa_node_params *info = ipa_node_params_sum->get (node);
   int eval_threshold = opt_for_fn (node->decl, param_ipa_cp_eval_threshold);
+  /* If we know the execution IPA execution counts, we can estimate overall
+     speedup of the program.  */
   if (count_sum.nonzero_p ())
     {
-      gcc_assert (base_count.nonzero_p ());
-      sreal factor = count_sum.probability_in (base_count).to_sreal ();
-      sreal evaluation = (time_benefit * factor) / size_cost;
+      profile_count saved_time = count_sum * time_benefit;
+      sreal evaluation = saved_time.to_sreal_scale (profile_count::one ())
+                             / size_cost;
       evaluation = incorporate_penalties (node, info, evaluation);
-      evaluation *= 1000;
 
       if (dump_file && (dump_flags & TDF_DETAILS))
        {
@@ -3394,33 +3416,46 @@ good_cloning_opportunity_p (struct cgraph_node *node, 
sreal time_benefit,
                   "size: %i, count_sum: ", time_benefit.to_double (),
                   size_cost);
          count_sum.dump (dump_file);
+         fprintf (dump_file, ", overall time saved: ");
+         saved_time.dump (dump_file);
          fprintf (dump_file, "%s%s) -> evaluation: %.2f, threshold: %i\n",
                 info->node_within_scc
                   ? (info->node_is_self_scc ? ", self_scc" : ", scc") : "",
                 info->node_calling_single_call ? ", single_call" : "",
                   evaluation.to_double (), eval_threshold);
        }
-
-      return evaluation.to_int () >= eval_threshold;
+      gcc_checking_assert (saved_time == saved_time.ipa ());
+      if (!maybe_hot_count_p (NULL, saved_time))
+       {
+         if (dump_file && (dump_flags & TDF_DETAILS))
+           fprintf (dump_file, "     not cloning: time saved is not hot\n");
+       }
+      /* Evaulation approximately corresponds to time saved per instruction
+        introduced.  This is likely almost always going to be true, since we
+        already checked that time saved is large enough to be considered
+        hot.  */
+      else if (evaluation.to_int () >= eval_threshold)
+       return true;
+      /* If all call sites have profile known; we know we do not want t clone.
+        If there are calls with unknown profile; try local heuristics.  */
+      if (!called_without_ipa_profile)
+       return false;
     }
-  else
-    {
-      sreal evaluation = (time_benefit * freq_sum) / size_cost;
-      evaluation = incorporate_penalties (node, info, evaluation);
-      evaluation *= 1000;
+  sreal evaluation = (time_benefit * freq_sum) / size_cost;
+  evaluation = incorporate_penalties (node, info, evaluation);
+  evaluation *= 1000;
 
-      if (dump_file && (dump_flags & TDF_DETAILS))
-       fprintf (dump_file, "     good_cloning_opportunity_p (time: %g, "
-                "size: %i, freq_sum: %g%s%s) -> evaluation: %.2f, "
-                "threshold: %i\n",
-                time_benefit.to_double (), size_cost, freq_sum.to_double (),
-                info->node_within_scc
-                  ? (info->node_is_self_scc ? ", self_scc" : ", scc") : "",
-                info->node_calling_single_call ? ", single_call" : "",
-                evaluation.to_double (), eval_threshold);
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "     good_cloning_opportunity_p (time: %g, "
+            "size: %i, freq_sum: %g%s%s) -> evaluation: %.2f, "
+            "threshold: %i\n",
+            time_benefit.to_double (), size_cost, freq_sum.to_double (),
+            info->node_within_scc
+              ? (info->node_is_self_scc ? ", self_scc" : ", scc") : "",
+            info->node_calling_single_call ? ", single_call" : "",
+            evaluation.to_double (), eval_threshold);
 
-      return evaluation.to_int () >= eval_threshold;
-    }
+  return evaluation.to_int () >= eval_threshold;
 }
 
 /* Grow vectors in AVALS and fill them with information about values of
@@ -3613,7 +3648,8 @@ estimate_local_effects (struct cgraph_node *node)
                     "known contexts, code not going to grow.\n");
        }
       else if (good_cloning_opportunity_p (node, time, stats.freq_sum,
-                                          stats.count_sum, size))
+                                          stats.count_sum, size,
+                                          stats.called_without_ipa_profile))
        {
          if (size + overall_size <= get_max_overall_size (node))
            {
@@ -3979,7 +4015,7 @@ value_topo_info<valtype>::propagate_effects ()
          processed_srcvals.empty ();
          for (src = val->sources; src; src = src->next)
            if (src->val
-               && src->cs->maybe_hot_p ())
+               && cs_interesting_for_ipcp_p (src->cs))
              {
                if (!processed_srcvals.add (src->val))
                  {
@@ -4024,21 +4060,6 @@ value_topo_info<valtype>::propagate_effects ()
     }
 }
 
-/* Callback for qsort to sort counts of all edges.  */
-
-static int
-compare_edge_profile_counts (const void *a, const void *b)
-{
-  const profile_count *cnt1 = (const profile_count *) a;
-  const profile_count *cnt2 = (const profile_count *) b;
-
-  if (*cnt1 < *cnt2)
-    return 1;
-  if (*cnt1 > *cnt2)
-    return -1;
-  return 0;
-}
-
 
 /* Propagate constants, polymorphic contexts and their effects from the
    summaries interprocedurally.  */
@@ -4051,10 +4072,6 @@ ipcp_propagate_stage (class ipa_topo_info *topo)
   if (dump_file)
     fprintf (dump_file, "\n Propagating constants:\n\n");
 
-  base_count = profile_count::uninitialized ();
-
-  bool compute_count_base = false;
-  unsigned base_count_pos_percent = 0;
   FOR_EACH_DEFINED_FUNCTION (node)
   {
     if (node->has_gimple_body_p ()
@@ -4071,57 +4088,8 @@ ipcp_propagate_stage (class ipa_topo_info *topo)
     ipa_size_summary *s = ipa_size_summaries->get (node);
     if (node->definition && !node->alias && s != NULL)
       overall_size += s->self_size;
-    if (node->count.ipa ().initialized_p ())
-      {
-       compute_count_base = true;
-       unsigned pos_percent = opt_for_fn (node->decl,
-                                          param_ipa_cp_profile_count_base);
-       base_count_pos_percent = MAX (base_count_pos_percent, pos_percent);
-      }
   }
 
-  if (compute_count_base)
-    {
-      auto_vec<profile_count> all_edge_counts;
-      all_edge_counts.reserve_exact (symtab->edges_count);
-      FOR_EACH_DEFINED_FUNCTION (node)
-       for (cgraph_edge *cs = node->callees; cs; cs = cs->next_callee)
-         {
-           profile_count count = cs->count.ipa ();
-           if (!count.nonzero_p ())
-             continue;
-
-           enum availability avail;
-           cgraph_node *tgt
-             = cs->callee->function_or_virtual_thunk_symbol (&avail);
-           ipa_node_params *info = ipa_node_params_sum->get (tgt);
-           if (info && info->versionable)
-             all_edge_counts.quick_push (count);
-         }
-
-      if (!all_edge_counts.is_empty ())
-       {
-         gcc_assert (base_count_pos_percent <= 100);
-         all_edge_counts.qsort (compare_edge_profile_counts);
-
-         unsigned base_count_pos
-           = ((all_edge_counts.length () * (base_count_pos_percent)) / 100);
-         base_count = all_edge_counts[base_count_pos];
-
-         if (dump_file)
-           {
-             fprintf (dump_file, "\nSelected base_count from %u edges at "
-                      "position %u, arriving at: ", all_edge_counts.length (),
-                      base_count_pos);
-             base_count.dump (dump_file);
-             fprintf (dump_file, "\n");
-           }
-       }
-      else if (dump_file)
-       fprintf (dump_file, "\nNo candidates with non-zero call count found, "
-                "continuing as if without profile feedback.\n");
-    }
-
   orig_overall_size = overall_size;
 
   if (dump_file)
@@ -4383,15 +4351,17 @@ static bool
 get_info_about_necessary_edges (ipcp_value<valtype> *val, cgraph_node *dest,
                                sreal *freq_sum, int *caller_count,
                                profile_count *rec_count_sum,
-                               profile_count *nonrec_count_sum)
+                               profile_count *nonrec_count_sum,
+                               bool *called_without_ipa_profile)
 {
   ipcp_value_source<valtype> *src;
   sreal freq = 0;
   int count = 0;
   profile_count rec_cnt = profile_count::zero ();
   profile_count nonrec_cnt = profile_count::zero ();
-  bool hot = false;
+  bool interesting = false;
   bool non_self_recursive = false;
+  *called_without_ipa_profile = false;
 
   for (src = val->sources; src; src = src->next)
     {
@@ -4402,15 +4372,19 @@ get_info_about_necessary_edges (ipcp_value<valtype> 
*val, cgraph_node *dest,
            {
              count++;
              freq += cs->sreal_frequency ();
-             hot |= cs->maybe_hot_p ();
+             interesting |= cs_interesting_for_ipcp_p (cs);
              if (cs->caller != dest)
                {
                  non_self_recursive = true;
                  if (cs->count.ipa ().initialized_p ())
                    rec_cnt += cs->count.ipa ();
+                 else
+                   *called_without_ipa_profile = true;
                }
              else if (cs->count.ipa ().initialized_p ())
                nonrec_cnt += cs->count.ipa ();
+             else
+               *called_without_ipa_profile = true;
            }
          cs = get_next_cgraph_edge_clone (cs);
        }
@@ -4426,19 +4400,7 @@ get_info_about_necessary_edges (ipcp_value<valtype> 
*val, cgraph_node *dest,
   *rec_count_sum = rec_cnt;
   *nonrec_count_sum = nonrec_cnt;
 
-  if (!hot && ipa_node_params_sum->get (dest)->node_within_scc)
-    {
-      struct cgraph_edge *cs;
-
-      /* Cold non-SCC source edge could trigger hot recursive execution of
-        function. Consider the case as hot and rely on following cost model
-        computation to further select right one.  */
-      for (cs = dest->callers; cs; cs = cs->next_caller)
-       if (cs->caller == dest && cs->maybe_hot_p ())
-         return true;
-    }
-
-  return hot;
+  return interesting;
 }
 
 /* Given a NODE, and a set of its CALLERS, try to adjust order of the callers
@@ -5928,6 +5890,7 @@ decide_about_value (struct cgraph_node *node, int index, 
HOST_WIDE_INT offset,
   sreal freq_sum;
   profile_count count_sum, rec_count_sum;
   vec<cgraph_edge *> callers;
+  bool called_without_ipa_profile;
 
   if (val->spec_node)
     {
@@ -5943,7 +5906,8 @@ decide_about_value (struct cgraph_node *node, int index, 
HOST_WIDE_INT offset,
       return false;
     }
   else if (!get_info_about_necessary_edges (val, node, &freq_sum, 
&caller_count,
-                                           &rec_count_sum, &count_sum))
+                                           &rec_count_sum, &count_sum,
+                                           &called_without_ipa_profile))
     return false;
 
   if (!dbg_cnt (ipa_cp_values))
@@ -5980,9 +5944,11 @@ decide_about_value (struct cgraph_node *node, int index, 
HOST_WIDE_INT offset,
 
   if (!good_cloning_opportunity_p (node, val->local_time_benefit,
                                   freq_sum, count_sum,
-                                  val->local_size_cost)
+                                  val->local_size_cost,
+                                  called_without_ipa_profile)
       && !good_cloning_opportunity_p (node, val->prop_time_benefit,
-                                     freq_sum, count_sum, val->prop_size_cost))
+                                     freq_sum, count_sum, val->prop_size_cost,
+                                     called_without_ipa_profile))
     return false;
 
   if (dump_file)
@@ -6564,7 +6530,6 @@ make_pass_ipa_cp (gcc::context *ctxt)
 void
 ipa_cp_cc_finalize (void)
 {
-  base_count = profile_count::uninitialized ();
   overall_size = 0;
   orig_overall_size = 0;
   ipcp_free_transformation_sum ();
diff --git a/gcc/params.opt b/gcc/params.opt
index a2b606fb9178..ef19051286be 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -273,10 +273,6 @@ The size of translation unit that IPA-CP pass considers 
large.
 Common Joined UInteger Var(param_ipa_cp_value_list_size) Init(8) Param 
Optimization
 Maximum size of a list of values associated with each parameter for 
interprocedural constant propagation.
 
--param=ipa-cp-profile-count-base=
-Common Joined UInteger Var(param_ipa_cp_profile_count_base) Init(10) 
IntegerRange(0, 100) Param Optimization
-When using profile feedback, use the edge at this percentage position in 
frequency histogram as the bases for IPA-CP heuristics.
-
 -param=ipa-jump-function-lookups=
 Common Joined UInteger Var(param_ipa_jump_function_lookups) Init(8) Param 
Optimization
 Maximum number of statements visited during jump function offset discovery.
diff --git a/gcc/profile-count.cc b/gcc/profile-count.cc
index 8b9d8e18c51c..374f06f4c083 100644
--- a/gcc/profile-count.cc
+++ b/gcc/profile-count.cc
@@ -519,3 +519,26 @@ profile_probability::pow (int n) const
     }
   return ret;
 }
+profile_count
+profile_count::operator* (const sreal &num) const
+{
+  if (m_val == 0)
+    return *this;
+  if (!initialized_p ())
+    return uninitialized ();
+  sreal scaled = num * m_val;
+  gcc_checking_assert (scaled >= 0);
+  profile_count ret;
+  if (m_val > max_count)
+    ret.m_val = max_count;
+  else
+    ret.m_val = scaled.to_nearest_int ();
+  ret.m_quality = MIN (m_quality, ADJUSTED);
+  return ret;
+}
+
+profile_count
+profile_count::operator*= (const sreal &num)
+{
+  return *this * num;
+}
diff --git a/gcc/profile-count.h b/gcc/profile-count.h
index 015aee981ca4..0e79fd241b51 100644
--- a/gcc/profile-count.h
+++ b/gcc/profile-count.h
@@ -1061,6 +1061,9 @@ public:
       return *this;
     }
 
+  profile_count operator* (const sreal &num) const;
+  profile_count operator*= (const sreal &num);
+
   profile_count operator/ (int64_t den) const
     {
       return apply_scale (1, den);
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-clone-4.c 
b/gcc/testsuite/gcc.dg/ipa/ipa-clone-4.c
new file mode 100644
index 000000000000..bf74e64c673b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-clone-4.c
@@ -0,0 +1,30 @@
+/* { dg-options "-O3 -fdump-ipa-cp" } */
+__attribute__ ((used))
+int a[1000];
+
+__attribute__ ((noinline))
+void
+test2(int sz)
+{
+  for (int i = 0; i < sz; i++)
+         a[i]++;
+  asm volatile (""::"m"(a));
+}
+
+__attribute__ ((noinline))
+void
+test1 (int sz)
+{
+  for (int i = 0; i < 1000; i++)
+         test2(sz);
+}
+int main()
+{
+       test1(1000);
+       return 0;
+}
+/* We should clone test1 and test2 for constant 1000.
+   In the past we did not do this since we did not clone for edges that are 
not hot
+   and call main->test1 is not considered hot since it is executed just once.  
*/
+/* { dg-final { scan-ipa-dump-times "Creating a specialized node of test1" 1 
"cp"} } */
+/* { dg-final { scan-ipa-dump-times "Creating a specialized node of test2" 1 
"cp"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-prof/ipa-cp-1.c 
b/gcc/testsuite/gcc.dg/tree-prof/ipa-cp-1.c
new file mode 100644
index 000000000000..ab6a7f7211f5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-prof/ipa-cp-1.c
@@ -0,0 +1,30 @@
+/* { dg-options "-O2 -fdump-ipa-cp" } */
+__attribute__ ((used))
+int a[1000];
+
+__attribute__ ((noinline))
+void
+test2(int sz)
+{
+  for (int i = 0; i < sz; i++)
+         a[i]++;
+  asm volatile (""::"m"(a));
+}
+
+__attribute__ ((noinline))
+void
+test1 (int sz)
+{
+  for (int i = 0; i < 1000; i++)
+         test2(sz);
+}
+int main()
+{
+       test1(1000);
+       return 0;
+}
+/* We should clone test1 and test2 for constant 1000.
+   In the past we did not do this since we did not clone for edges that are 
not hot
+   and call main->test1 is not considered hot since it is executed just once.  
*/
+/* { dg-final-use { scan-ipa-dump-times "Creating a specialized node of test1" 
1 "cp"} } */
+/* { dg-final-use { scan-ipa-dump-times "Creating a specialized node of test2" 
1 "cp"} } */

[gcc r16-101] Enable ip-cp cloning over non-hot edges

Reply via email to