Hi Honza,
The attached patch enables time profile based reordering with AutoFDO with
-fauto-profile -fprofile-reorder-functions, by mapping timestamps obtained from 
perf
into node->tp_first_run, and is based on top of Dhruv's sourcefile tracking 
patch:
https://gcc.gnu.org/pipermail/gcc-patches/2025-September/694800.html

The rationale for doing this is:
(1) GCC already implements time-profile function reordering, the patch enables
it with AutoFDO.
(2) While time profile ordering is primarily meant for optimizing startup time,
we've also observed good effects on code-locality for large internal workloads.
(3) Possibly useful for function reordering when accurate profile annotation is
hard with AutoFDO -- For eg, if branch samples are missing (due to absence of
LBR like structure).

On AutoFDO tools side, I have a patch that extends gcov to emit 64-bit perf 
timestamp that
records first execution of function, which loosely corresponds to PGO's 
time_profile counter.
The timestamp is stored adjacent to head field in toplevel function info.
I will post a patch for this shortly on AutoFDO tools upstream repo.

On GCC side, the patch makes the following changes:

(1) Changes to auto-profile pass:
The patch adds a new field timestamp to function_instance,
and populates it in read_function_instance.

It maintains a new timestamp_info_map from timestamp -> <name, tp_first_run>,
which maps timestamps sorted in ascending order to (1..N), so lowest ordered
timestamp is mapped to 1 and so on. The rationale for this is that
timestamps are 64-bit integers, and we don't need the full 64-bit range
for ordering by tp_first_run.

During annotation, the timestamp associated with function_instance is looked up
in timestamp_info_map, and corresponding mapped value is assigned
to node->tp_first_run.

(2) Handling clones:
Currently, for clones not registered in call graph before auto-profile pass,
the tp_first_run field is copied from original function, when the clone is 
created.
However that may not correspond to the actual order of functions.

For eg, if we have two profiled clones of foo:
foo.constprop.1, foo.constprop.2

both will get same value for tp_first_run as foo->tp_first_run, which might
not correspond to time profile order.

To address this, the patch introduces a new IPA pass ipa_adjust_tp_first_run,
that streams <clone name, tp_first_run> from timestamp_info_map during LGEN, and
during WPA reads it, and sets clone's tp_first_run field accordingly.
The pass is placed pretty late (just before locality_cloning), by that
point clones would be registered in the call graph.

Dhruv's sourcefile tracking patch already handles LTO privatized functions.
The patch adds a (temporary) workaround for functions with mismatched/empty 
filenames from gcov,
to avoid getting dropped in afdo_annotate_cfg by iterating thru all filenames 
in afdo_string_table if
get_function_instance_by_decl fails to find function_instance with lbasename 
(DECL_SOURCE_FILE (decl)).

(3) Grouping profiled functions together in as few partitions as possible 
(preferably single).
The patch places profiled functions in time profile order together in as
few paritions as possible to get better advantage of code locality.
Unlike PGO, where every instrumented function gets a time profile counter,
with AutoFDO, the sampled functions are a fraction of the total executed ones.
Similarly, in default_function_section, it overrides hot/cold partitioning so
that grouping of profiled functions isn't disrupted.

(4) Option to disable profile driven opts.
The patch adds option -fauto-profile-reorder-only which only enables 
time-profile reordering with
AutoFDO (and disables profile driven opts):
(a) Useful as a debugging aid to isolate regression to either function 
reordering or profile driven opts.
(b) For our use case, it's also seemingly useful as a stopgap measure to avoid 
regressions with AutoFDO
profile driven opts, due to issues with profile quality obtained with merging 
of SPE and non SPE profiles.
We're actively working on resolving this.
(c) Possibly useful for architectures which do not support branch sampling.
The option is disabled by default.

Ideally, I would like to make it a param (and not user facing option), but I am 
not able to control enabling/disabling options in opts.cc:common_handle_option
based on param value, will investigate this further.

* Results

On one large interal workload, the patch (along with sourcefile tracking 
patch), gives an uplift of 32.63% compared to LTO,
and 8.07% compared to LTO + AutoFDO trunk, and for another workload it gives an 
uplift of 15.31% compared to LTO, and 7.76% compared to LTO + AutoFDO trunk.
I will try benchmarking with SPEC2017.

Will be grateful for suggestions on how to proceed further.

Signed-off-by: Prathamesh Kulkarni <[email protected]>

Thanks,
Prathamesh
[RFC] Enable time profile function reordering with AutoFDO

The patch enables time profile based reordering with AutoFDO with
-fauto-profile -fprofile-reorder-functions, by mapping timestamps obtained from 
perf
into node->tp_first_run. 

Signed-off-by: Prathamesh Kulkarni <[email protected]>

gcc/ChangeLog:

        * auto-profile.cc: Include data-streamer.h, and lto-streamer.h
        (string_table::filenames): New method.
        (function_instance::timestamp_): New member.
        (function_instance::timestamp): New accessor for timestamp_ member.
        (autofdo::timestamp_info_map): New std::map.
        (function_instance::function_instance): Add argument timestamp and set
        it to member timestamp_.
        (function_instance::read_function_instance): Adjust prototype and read
        timestamp.
        (autofdo_source_profile::offline_unrealized_inlines): Put a check for
        in_map before removing function instance.
        (autofdo_source_profile::get_function_instance_by_decl): New argument
        filename with default value NULL.
        (autofdo_source_profile::read): Populate timestamp_info_map.
        (afdo_annotate_cfg): Assign node->tp_first_run based on
        timestamp_info_map and bail out of annotation if
        flag_auto_profile_reorder_only is enabled.
        (clone_function_p): New function.
        (clone_timestamp_map): New std::map.
        (adjust_tp_first_run_generate_summary): New function.
        (adjust_tp_first_run_read_summary): Likewise.
        (adjust_tp_first_run_write_summary): Likewise.
        (adjust_tp_first_run_read_summary): Likewise.
        (adjust_tp_first_run): Likewise.
        (pass_ipa_adjust_tp_first_run): New pass.
        (make_pass_ipa_adjust_tp_first_run): New function.
        * common.opt: New option -fauto-profile-reorder-only.
        * ipa-locality-cloning.cc (partition_by_timestamps): New function.
        (locality_partition_and_clone): Call partition_by_timestamps.
        * lto-section-in.cc: Add entry for ipa_adjust_tp_first_run. 
        * lto-streamer.h (enum lto_section_type): New member
        LTO_section_ipa_adjust_tp_first_run.
        * opts.cc (common_handle_option): Control enabling of profile driven
        opts based on flag_auto_profile_reorder_only.
        * passes.def: Add entry for pass ipa_adjust_tp_first_run.
        * tree-pass.h (make_pass_ipa_adjust_tp_first_run): Declare.
        * varasm.cc (default_function_section): Bail out if -fauto-profile
        -fprofile-reorder-functions is enabled and node->tp_first_run > 0.

gcc/lto/ChangeLog:

        * lto-partition.cc (create_partition): Move higher up in the file.
        (partition_by_timestamps): New function.
        (lto_balanced_map): Call partition_by_timestamps if -fauto-profile
        -fprofile-reorder-functions is passed and noreroder is empty.

diff --git a/gcc/auto-profile.cc b/gcc/auto-profile.cc
index ae604762165..58e583d5b87 100644
--- a/gcc/auto-profile.cc
+++ b/gcc/auto-profile.cc
@@ -54,6 +54,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pretty-print.h"
 #include "gimple-pretty-print.h"
 #include "output.h"
+#include "data-streamer.h"
+#include "lto-streamer.h"
 
 /* The following routines implements AutoFDO optimization.
 
@@ -263,6 +265,8 @@ public:
 
   /* Return cgraph node corresponding to given name index.  */
   cgraph_node *get_cgraph_node (int);
+
+  const string_vector& filenames () { return filenames_; }
 private:
   typedef std::map<const char *, unsigned, string_compare> string_index_map;
   typedef std::map<const char *, auto_vec<unsigned>, string_compare>
@@ -294,7 +298,7 @@ public:
      too. STACK is used to track the recursive creation process.  */
   static function_instance *
   read_function_instance (function_instance_stack *stack,
-                          gcov_type head_count);
+                         gcov_type head_count, bool toplevel = true);
 
   /* Recursively deallocate all callsites (nested function_instances).  */
   ~function_instance ();
@@ -329,6 +333,12 @@ public:
     return head_count_;
   }
 
+  gcov_type
+  timestamp () const
+  {
+    return timestamp_;
+  }
+
   /* Traverse callsites of the current function_instance to find one at the
      location of LINENO and callee name represented in DECL.
      LOCATION should match LINENO and is used to output diagnostics.  */
@@ -489,11 +499,13 @@ private:
   /* Map from callsite to callee function_instance.  */
   typedef std::map<callsite, function_instance *> callsite_map;
 
-  function_instance (unsigned name, unsigned file_name, gcov_type head_count)
+  function_instance (unsigned name, unsigned file_name, gcov_type head_count,
+                    gcov_type timestamp)
     : name_ (name), file_name_ (file_name), total_count_ (0),
-      head_count_ (head_count), removed_icall_target_ (false),
-      realized_ (false), in_worklist_ (false), inlined_to_ (NULL),
-      location_ (UNKNOWN_LOCATION), call_location_ (UNKNOWN_LOCATION)
+      head_count_ (head_count), timestamp_ (timestamp),
+      removed_icall_target_ (false), realized_ (false), in_worklist_ (false),
+      inlined_to_ (NULL), location_ (UNKNOWN_LOCATION),
+      call_location_ (UNKNOWN_LOCATION)
   {
   }
 
@@ -512,6 +524,9 @@ private:
   /* Entry BB's sample count.  */
   gcov_type head_count_;
 
+  /* perf timestamp associated with first execution of function.  */
+  gcov_type timestamp_;
+
   /* Map from callsite location to callee function_instance.  */
   callsite_map callsites;
 
@@ -559,7 +574,7 @@ public:
   ~autofdo_source_profile ();
 
   /* For a given DECL, returns the top-level function_instance.  */
-  function_instance *get_function_instance_by_decl (tree decl) const;
+  function_instance *get_function_instance_by_decl (tree decl, const char * = 
NULL) const;
 
   /* For a given name index (and original name), returns the top-level
    * function_instance.  */
@@ -632,6 +647,13 @@ static autofdo_source_profile *afdo_source_profile;
 /* gcov_summary structure to store the profile_info.  */
 static gcov_summary *afdo_profile_info;
 
+/* Map from timestamp -> <name, tp_first_run>.
+
+The purpose of this map is to map 64-bit timestamp values to (1..N),
+sorted by ascending order of timestamps and assign that to node->tp_first_run,
+since we don't need the full 64-bit range.  */
+static std::map<gcov_type, std::pair<int, int>> timestamp_info_map;
+
 /* Scaling factor for afdo data.  Compared to normal profile
    AFDO profile counts are much lower, depending on sampling
    frequency.  We scale data up to reudce effects of roundoff
@@ -2477,7 +2499,7 @@ autofdo_source_profile::offline_unrealized_inlines ()
          delete f;
        }
       /* If function is not in symbol table, remove it.  */
-      else if (!f->realized_p ())
+      else if (in_map && !f->realized_p ())
        {
          if (dump_file)
            fprintf (dump_file, "Removing optimized out function %s\n",
@@ -2502,6 +2524,7 @@ autofdo_source_profile::offline_unrealized_inlines ()
 /* function instance profile format:
 
    ENTRY_COUNT: 8 bytes
+   TIMESTAMP: 8 bytes (only for toplevel symbols)
    NAME_INDEX: 4 bytes
    NUM_POS_COUNTS: 4 bytes
    NUM_CALLSITES: 4 byte
@@ -2528,14 +2551,17 @@ autofdo_source_profile::offline_unrealized_inlines ()
 
 function_instance *
 function_instance::read_function_instance (function_instance_stack *stack,
-                                          gcov_type head_count)
+                                          gcov_type head_count, bool toplevel)
 {
+  gcov_type_unsigned timestamp = 0;
+  if (toplevel)
+    timestamp = (gcov_type_unsigned) gcov_read_counter ();
   unsigned name = gcov_read_unsigned ();
   unsigned num_pos_counts = gcov_read_unsigned ();
   unsigned num_callsites = gcov_read_unsigned ();
   function_instance *s
     = new function_instance (name, afdo_string_table->get_filename_idx (name),
-                            head_count);
+                            head_count, timestamp);
   if (!stack->is_empty ())
     s->set_inlined_to (stack->last ());
   stack->safe_push (s);
@@ -2563,7 +2589,7 @@ function_instance::read_function_instance 
(function_instance_stack *stack,
     {
       unsigned offset = gcov_read_unsigned ();
       function_instance *callee_function_instance
-          = read_function_instance (stack, -1);
+       = read_function_instance (stack, -1, false);
       s->callsites[std::make_pair (offset, callee_function_instance->name ())]
           = callee_function_instance;
     }
@@ -2585,9 +2611,11 @@ autofdo_source_profile::~autofdo_source_profile ()
 /* For a given DECL, returns the top-level function_instance.  */
 
 function_instance *
-autofdo_source_profile::get_function_instance_by_decl (tree decl) const
+autofdo_source_profile::get_function_instance_by_decl (tree decl, const char 
*filename) const
 {
-  const char *filename = lbasename (DECL_SOURCE_FILE (decl));
+  if (!filename)
+    filename = lbasename (DECL_SOURCE_FILE (decl));
+
   int index = afdo_string_table->get_index_by_decl (decl);
   if (index == -1)
     return NULL;
@@ -2829,7 +2857,21 @@ autofdo_source_profile::read ()
        fatal_error (UNKNOWN_LOCATION,
                     "auto-profile contains duplicated function instance %s",
                     afdo_string_table->get_name (s->name ()));
+
+      timestamp_info_map.insert({s->timestamp (),
+                                std::make_pair (s->name (), 0)});
     }
+
+  /* timestamp_info_map is std::map with timestamp as key,
+     so it's already sorted in ascending order wrt timestamps.
+     This loop maps function with lowest timestamp to 1, and so on.
+     In afdo_annotate_cfg, node->tp_first_run is then set to corresponding
+     tp_first_run value.  */
+
+  int tp_first_run = 1;
+  for (auto &p : timestamp_info_map)
+    p.second.second = tp_first_run++;
+
   int hot_frac = param_hot_bb_count_fraction;
   /* Scale up the profile, but leave some bits in case some counts gets
      bigger than sum_max eventually.  */
@@ -4079,6 +4121,17 @@ afdo_annotate_cfg (void)
       = afdo_source_profile->get_function_instance_by_decl (
           current_function_decl);
 
+  /* FIXME: This is a workaround for sourcefile tracking, if afdo_string_table
+     ends up with empty filename or incorrect filename for the function and
+     should be removed once issues with sourcefile tracking get fixed.  */
+  if (s == NULL)
+    for (unsigned i = 0; i < afdo_string_table->filenames ().length (); i++)
+      {
+       s = afdo_source_profile->get_function_instance_by_decl 
(current_function_decl, afdo_string_table->filenames()[i]);
+       if (s)
+         break;
+      }
+
   if (s == NULL)
     {
       if (dump_file)
@@ -4087,7 +4140,8 @@ afdo_annotate_cfg (void)
       /* create_gcov only dumps symbols with some samples in them.
         This means that we get nonempty zero_bbs only if some
         nonzero counts in profile were not matched with statements.  */
-      if (!flag_profile_partial_training)
+      if (!flag_profile_partial_training
+         && !flag_auto_profile_reorder_only)
        {
          FOR_ALL_BB_FN (bb, cfun)
            if (bb->count.quality () == GUESSED_LOCAL)
@@ -4097,6 +4151,20 @@ afdo_annotate_cfg (void)
       return;
     }
 
+  if (auto it = timestamp_info_map.find (s->timestamp ());
+      it != timestamp_info_map.end ())
+    {
+      cgraph_node *node = cgraph_node::get (current_function_decl);
+      node->tp_first_run = it->second.second;
+
+      if (dump_file)
+       fprintf (dump_file, "Setting %s->tp_first_run to %d\n",
+                node->asm_name (), node->tp_first_run);
+    }
+
+  if (flag_auto_profile_reorder_only)
+    return;
+
   calculate_dominance_info (CDI_POST_DOMINATORS);
   calculate_dominance_info (CDI_DOMINATORS);
   loop_optimizer_init (0);
@@ -4496,3 +4564,182 @@ make_pass_ipa_auto_profile_offline (gcc::context *ctxt)
 {
   return new pass_ipa_auto_profile_offline (ctxt);
 }
+
+/* Map from name -> tp_first_run for clones. We need this for initiailizing 
tp_first_run
+   of clones created after auto-profile pass.  */
+
+static std::map<const char *, int, autofdo::string_compare> 
clone_timestamp_map;
+
+/* Check if name is a clone function name. LTO privatized clones should already
+   be handled with sourcefile tracking.  */
+
+static bool
+clone_function_p (const char *name)
+{
+  return strchr (name, '.') && !strstr (name, "lto_priv");
+}
+
+/* Gather <name, tp_first_run> value from autofdo::timestamp_info_map.  */
+
+static void
+adjust_tp_first_run_generate_summary (void)
+{
+  for (const auto &p : autofdo::timestamp_info_map)
+    {
+      gcov_type timestamp = p.first;
+      int name_id = p.second.first;
+      int tp_first_run = p.second.second;
+
+      if (const char *name = autofdo::afdo_string_table->get_name (name_id);
+         name && clone_function_p (name))
+       clone_timestamp_map.insert({xstrdup (name), tp_first_run});
+    }
+}
+
+/* Stream out <name, tp_first_run>.  */
+
+static void
+adjust_tp_first_run_write_summary (void)
+{
+  struct output_block *ob
+    = create_output_block (LTO_section_ipa_adjust_tp_first_run);
+
+  streamer_write_uhwi (ob, clone_timestamp_map.size ());
+  for (const auto &p : clone_timestamp_map)
+    {
+      streamer_write_string (ob, ob->main_stream, p.first, true);
+      streamer_write_hwi (ob, p.second);
+    }
+
+  produce_asm (ob);
+  destroy_output_block (ob);
+}
+
+/* Stream in <name, tp_first_run> and put it in clone_timestamp_map.  */
+
+static void
+adjust_tp_first_run_read_summary (void)
+{
+  struct lto_file_decl_data **file_data_vec = lto_get_file_decl_data ();
+  struct lto_file_decl_data *file_data;
+  unsigned j = 0;
+
+  while ((file_data = file_data_vec[j++]))
+    {
+      size_t len;
+      const char *data
+       = lto_get_summary_section_data (file_data,
+                                       LTO_section_ipa_adjust_tp_first_run,
+                                       &len);
+      if (!data)
+       continue;
+
+      const struct lto_function_header *header
+       = (const struct lto_function_header *) data;
+      const int cfg_offset = sizeof (struct lto_function_header);
+      const int main_offset = cfg_offset + header->cfg_size;
+      const int string_offset = main_offset + header->main_size;
+      struct data_in *data_in;
+
+      lto_input_block ib ((const char *) data + main_offset,
+                         header->main_size, file_data);
+
+      data_in
+       = lto_data_in_create (file_data, (const char *) data + string_offset,
+                             header->string_size, vNULL);
+
+      unsigned num_clones = streamer_read_uhwi (&ib);
+      for (unsigned i = 0; i < num_clones; i++)
+       {
+         const char *fn_name = streamer_read_string (data_in, &ib);
+         int tp_first_run = streamer_read_hwi (&ib);
+         clone_timestamp_map.insert ({xstrdup (fn_name), tp_first_run});
+       }
+
+      lto_free_section_data (file_data, LTO_section_ipa_adjust_tp_first_run,
+                            NULL, data, len);
+      lto_data_in_delete (data_in);
+    }
+}
+
+/* A late IPA pass for handling clones.
+   During auto-profile pass, for clones that aren't registered in call graph,
+   stream <name, tp_first_run> from timestamp_info_map during LGEN.
+   And during WPA, stream back the info, and assign the field value.
+   The pass is placed sufficiently late (just before locality_cloning), that
+   it should have clones registered in the call graph.  */
+
+static unsigned
+adjust_tp_first_run ()
+{
+  for (const auto &p : clone_timestamp_map)
+    if (cgraph_node *node
+       = cgraph_node::get_for_asmname (get_identifier (p.first)))
+      {
+       node->tp_first_run = p.second;
+       if (dump_file)
+         fprintf (dump_file, "Setting %s->tp_first_run to %d\n",
+                  node->asm_name (), node->tp_first_run);
+      }
+
+  for (const auto &p : clone_timestamp_map)
+    free ((void *) p.first);
+  clone_timestamp_map.clear ();
+
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_ipa_adjust_tp_first_run =
+{
+  IPA_PASS, /* type */
+  "adjust-tp_first_run", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_CGRAPHOPT, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_ipa_adjust_tp_first_run : public ipa_opt_pass_d
+{
+public:
+  pass_ipa_adjust_tp_first_run (gcc::context *ctxt)
+    : ipa_opt_pass_d (pass_data_ipa_adjust_tp_first_run, ctxt,
+                     adjust_tp_first_run_generate_summary, /* generate_summary 
*/
+                     adjust_tp_first_run_write_summary, /* write_summary */
+                     adjust_tp_first_run_read_summary, /* read_summary */
+                     NULL, /* write_optimization_summary */
+                     NULL, /* read_optimization_summary */
+                     NULL, /* stmt_fixup */
+                        0, /* function_transform_todo_flags_start */
+                     NULL, /* function_transform */
+                     NULL) /* variable_transform */
+  {}
+
+  /* opt_pass methods: */
+  bool
+  gate (function *) final override
+  {
+    return flag_auto_profile;
+  }
+
+  unsigned
+  execute (function *) final override
+  {
+    return adjust_tp_first_run ();
+  }
+
+}; // class pass_ipa_adjust_tp_first_run
+
+} // anon namespace
+
+ipa_opt_pass_d *
+make_pass_ipa_adjust_tp_first_run (gcc::context *ctxt)
+{
+  return new pass_ipa_adjust_tp_first_run (ctxt);
+}
+
diff --git a/gcc/common.opt b/gcc/common.opt
index f6d93dc05fb..d7c028f87fe 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1203,6 +1203,10 @@ fauto-profile-inlining
 Common Var(flag_auto_profile_inlining) Init(1) Optimization
 Perform inlining using auto-profile.
 
+fauto-profile-reorder-only
+Common Var(flag_auto_profile_reorder_only) Init(0) Optimization
+Perform only function based reordering with fauto-profile.
+
 ; -fcheck-bounds causes gcc to generate array bounds checks.
 ; For C, C++ and ObjC: defaults off.
 ; For Java: defaults to on.
diff --git a/gcc/ipa-locality-cloning.cc b/gcc/ipa-locality-cloning.cc
index 2684046bd2d..1653ea7b961 100644
--- a/gcc/ipa-locality-cloning.cc
+++ b/gcc/ipa-locality-cloning.cc
@@ -994,6 +994,50 @@ locality_determine_static_order (auto_vec<locality_order 
*> *order)
   return true;
 }
 
+/* Similar to lto-partition.cc:partition_by_timestamp.s  */
+
+static locality_partition
+partition_by_timestamps (int64_t max_partition_size, int& npartitions)
+{
+  vec<cgraph_node *> cnodes = vNULL;
+  cgraph_node *node;
+  FOR_EACH_DEFINED_FUNCTION (node)
+    if (node->tp_first_run > 0
+       && node->get_partitioning_class () == SYMBOL_PARTITION
+       && ipa_size_summaries->get (node))
+      cnodes.safe_push (node);
+  cnodes.qsort (tp_first_run_node_cmp);
+
+  unsigned n_partitioned_symbols = 0;
+
+  locality_partition partition = NULL;
+  unsigned i;
+  FOR_EACH_VEC_ELT (cnodes, i, node)
+    {
+      if (node_partitioned_p (node))
+       continue;
+      ipa_size_summary *size_summary = ipa_size_summaries->get (node);
+
+      if (partition == NULL
+         || partition->insns + size_summary->size > max_partition_size)
+       partition = create_partition (npartitions);
+
+      add_node_to_partition (partition, node);
+      n_partitioned_symbols++;
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file,
+              "partition_by_timestamps: Total partitions: %d\n", npartitions);
+      fprintf (dump_file,
+              "partition_by_timestamps: Number of partitioned symbols: %d\n",
+              n_partitioned_symbols);
+    }
+
+  return partition;
+}
+
 /* Partitioning for code locality.
    1. Create and sort callchains.  If PGO is available, use real profile
    counts.  Otherwise, use a set of heuristics to sort the callchains.
@@ -1007,7 +1051,7 @@ locality_partition_and_clone (int 
max_locality_partition_size,
                              lto_locality_cloning_model cloning_model,
                              int freq_denominator, int size)
 {
-  locality_partition partition;
+  locality_partition partition = NULL;
   int npartitions = 0;
 
   auto_vec<locality_order *> order;
@@ -1040,7 +1084,11 @@ locality_partition_and_clone (int 
max_locality_partition_size,
   int64_t partition_size
     = max_locality_partition_size
        ? max_locality_partition_size : param_max_partition_size;
-  partition = create_partition (npartitions);
+
+  if (flag_auto_profile && flag_profile_reorder_functions)
+    partition = partition_by_timestamps (partition_size, npartitions);
+  if (!partition)
+    partition = create_partition (npartitions);
 
   for (unsigned i = 0; i < order.length (); i++)
     {
diff --git a/gcc/lto-section-in.cc b/gcc/lto-section-in.cc
index 1dd9520137a..2e2e30ea400 100644
--- a/gcc/lto-section-in.cc
+++ b/gcc/lto-section-in.cc
@@ -57,6 +57,7 @@ const char *lto_section_name[LTO_N_SECTION_TYPES] =
   "ipa_sra",
   "odr_types",
   "ipa_modref",
+  "ipa_adjust_tp_first_run",
 };
 
 /* Hooks so that the ipa passes can call into the lto front end to get
diff --git a/gcc/lto-streamer.h b/gcc/lto-streamer.h
index 4b7209e3d82..242579918b0 100644
--- a/gcc/lto-streamer.h
+++ b/gcc/lto-streamer.h
@@ -225,6 +225,7 @@ enum lto_section_type
   LTO_section_ipa_sra,
   LTO_section_odr_types,
   LTO_section_ipa_modref,
+  LTO_section_ipa_adjust_tp_first_run,
   LTO_N_SECTION_TYPES          /* Must be last.  */
 };
 
diff --git a/gcc/lto/lto-partition.cc b/gcc/lto/lto-partition.cc
index c7e69ee3161..c7937107ad7 100644
--- a/gcc/lto/lto-partition.cc
+++ b/gcc/lto/lto-partition.cc
@@ -87,6 +87,15 @@ new_partition (const char *name)
   return part;
 }
 
+/* Create and return the created partition of name NAME.  */
+
+static ltrans_partition
+create_partition (int &npartitions, const char *name)
+{
+  npartitions++;
+  return new_partition (name);
+}
+
 /* Free memory used by ltrans partition.
    Encoder can be kept to be freed after streaming.  */
 static void
@@ -1017,6 +1026,50 @@ lto_cache_map (int n_lto_partitions, int 
max_partition_size)
   partitioner.apply (partitions, n_lto_partitions);
 }
 
+/* Place symbols that are profiled in as few partitions as possible.  */
+
+static ltrans_partition
+partition_by_timestamps (int max_partition_size, int &npartitions)
+{
+  vec<cgraph_node *> cnodes = vNULL;
+  cgraph_node *node;
+  FOR_EACH_DEFINED_FUNCTION (node)
+    if (node->tp_first_run > 0
+       && node->get_partitioning_class () == SYMBOL_PARTITION
+       && ipa_size_summaries->get (node))
+      cnodes.safe_push (node);
+  cnodes.qsort (tp_first_run_node_cmp);
+
+  unsigned n_partitioned_symbols = 0;
+
+  ltrans_partition partition = NULL;
+  unsigned i;
+  FOR_EACH_VEC_ELT (cnodes, i, node)
+    {
+      if (symbol_partitioned_p (node))
+       continue;
+
+      ipa_size_summary *size_summary = ipa_size_summaries->get (node);
+      if (partition == NULL
+         || partition->insns + size_summary->size > max_partition_size)
+       partition = create_partition (npartitions, "");
+
+      add_symbol_to_partition (partition, node);
+      n_partitioned_symbols++;
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file,
+              "partition_by_timestamps: Total partitions: %d\n", npartitions);
+      fprintf (dump_file,
+              "partition_by_timestamps: Number of partitioned symbols: %d\n",
+              n_partitioned_symbols);
+    }
+
+  return partition;
+}
+
 /* Group cgraph nodes into equally-sized partitions.
 
    The partitioning algorithm is simple: nodes are taken in predefined order.
@@ -1072,7 +1125,7 @@ lto_balanced_map (int n_lto_partitions, int 
max_partition_size)
   int64_t cost = 0, internal = 0;
   unsigned int best_n_nodes = 0, best_i = 0;
   int64_t best_cost = -1, best_internal = 0, best_size = 0;
-  int npartitions;
+  int npartitions = 0;
   int current_order = -1;
   int noreorder_pos = 0;
 
@@ -1092,6 +1145,19 @@ lto_balanced_map (int n_lto_partitions, int 
max_partition_size)
 
   original_total_size = total_size;
 
+  /* With -fauto-profile -fprofile-reorder-functions, place all symbols which
+     are profiled together into as few partitions as possible. The rationale
+     for doing this with AutoFDO is that the number of sampled functions is a
+     fraction of total number of executed functions (unlike PGO where each
+     executed function gets instrumented with time profile counter), and
+     placing them together helps with code locality.  */
+
+  partition = NULL;
+  if (flag_auto_profile
+      && flag_profile_reorder_functions
+      && noreorder.length () == 0)
+    partition = partition_by_timestamps (max_partition_size, npartitions);
+
   /* Streaming works best when the source units do not cross partition
      boundaries much.  This is because importing function from a source
      unit tends to import a lot of global trees defined there.  We should
@@ -1126,8 +1192,12 @@ lto_balanced_map (int n_lto_partitions, int 
max_partition_size)
   partition_size = total_size / n_lto_partitions;
   if (partition_size < param_min_partition_size)
     partition_size = param_min_partition_size;
-  npartitions = 1;
-  partition = new_partition ("");
+
+  if (npartitions == 0)
+    {
+      npartitions = 1;
+      partition = new_partition ("");
+    }
   if (dump_file)
     fprintf (dump_file, "Total unit size: %" PRId64 ", partition size: %" 
PRId64 "\n",
             total_size, partition_size);
@@ -1474,15 +1544,6 @@ add_node_references_to_partition (ltrans_partition 
partition, symtab_node *node)
 
 }
 
-/* Create and return the created partition of name NAME.  */
-
-static ltrans_partition
-create_partition (int &npartitions, const char *name)
-{
-  npartitions++;
-  return new_partition (name);
-}
-
 /* Partitioning for code locality.
    The partitioning plan (and prerequisite cloning) will have been done by the
    IPA locality cloning pass.  This function just implements that plan by
diff --git a/gcc/opts.cc b/gcc/opts.cc
index 10ce2c3de33..a5a9700dc08 100644
--- a/gcc/opts.cc
+++ b/gcc/opts.cc
@@ -3179,7 +3179,13 @@ common_handle_option (struct gcc_options *opts,
       /* No break here - do -fauto-profile processing. */
       /* FALLTHRU */
     case OPT_fauto_profile:
-      enable_fdo_optimizations (opts, opts_set, value, true);
+      if (opts->x_flag_auto_profile_reorder_only)
+       {
+         opts->x_flag_profile_reorder_functions = true;
+         opts->x_flag_auto_profile_inlining = false;
+       }
+      else
+       enable_fdo_optimizations (opts, opts_set, value, true);
       SET_OPTION_IF_UNSET (opts, opts_set, flag_profile_correction, value);
       break;
 
diff --git a/gcc/passes.def b/gcc/passes.def
index 3f828477b68..bd0e80dfa5f 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -171,6 +171,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_ipa_sra);
   NEXT_PASS (pass_ipa_fn_summary);
   NEXT_PASS (pass_ipa_inline);
+  NEXT_PASS (pass_ipa_adjust_tp_first_run);
   NEXT_PASS (pass_ipa_locality_cloning);
   NEXT_PASS (pass_ipa_pure_const);
   NEXT_PASS (pass_ipa_modref);
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 61cec52c624..8544a6bc552 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -553,6 +553,7 @@ extern ipa_opt_pass_d *make_pass_ipa_cdtor_merge 
(gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_single_use (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_comdats (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_modref (gcc::context *ctxt);
+extern ipa_opt_pass_d *make_pass_ipa_adjust_tp_first_run (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_locality_cloning (gcc::context *ctxt);
 
 extern gimple_opt_pass *make_pass_cleanup_cfg_post_optimizing (gcc::context
diff --git a/gcc/varasm.cc b/gcc/varasm.cc
index 0d78f5b384f..34673a970ea 100644
--- a/gcc/varasm.cc
+++ b/gcc/varasm.cc
@@ -610,6 +610,15 @@ default_function_section (tree decl, enum node_frequency 
freq,
   if (!flag_reorder_functions
       || !targetm_common.have_named_sections)
     return NULL;
+
+  /* With -fauto-profile -fprofile-reorder-functions, give higher priority
+     to time profile based reordering, and ensure the reordering isn't split
+     by hot/cold partitioning.  */
+  if (flag_auto_profile
+      && flag_profile_reorder_functions
+      && cgraph_node::get (decl)->tp_first_run > 0)
+    return NULL;
+
   /* Startup code should go to startup subsection unless it is
      unlikely executed (this happens especially with function splitting
      where we can split away unnecessary parts of static constructors.  */

Reply via email to