Hi Honza,
Thanks for the review, please find responses inline below.

> -----Original Message-----
> From: Jan Hubicka <[email protected]>
> Sent: 01 December 2025 04:36
> To: Prathamesh Kulkarni <[email protected]>
> Cc: [email protected]
> Subject: Re: [RFC] Enable time profile function reordering with
> AutoFDO
> 
> External email: Use caution opening links or attachments
> 
> 
> > (1) Changes to auto-profile pass:
> > The patch adds a new field timestamp to function_instance, and
> > populates it in read_function_instance.
> >
> > It maintains a new timestamp_info_map from timestamp -> <name,
> > tp_first_run>, which maps timestamps sorted in ascending order to
> > (1..N), so lowest ordered timestamp is mapped to 1 and so on. The
> > rationale for this is that timestamps are 64-bit integers, and we
> > don't need the full 64-bit range for ordering by tp_first_run.
> >
> > During annotation, the timestamp associated with function_instance
> is
> > looked up in timestamp_info_map, and corresponding mapped value is
> > assigned to node->tp_first_run.
> 
> I assume that the autofdo assigns timestamps to offile function
> instances, so this can also fit into function_instance datastructure,
> but I also suppose you want to keep it separate so it is optional?
function_instance does store timestamp, and uses it as a key into 
timestamp_info_map.
My intent of keeping a separate data-structure timestamp_info_map was to sort 
64-bit timestamps into ascending order,
and then map the sorted timestamp values to (1..N), and assign the mapped value 
to node->tp_first_run.
So we don't need to up size of node->tp_first_run to 64bit (and I guess we 
won't need the full 64-bit range here?)
> >
> > (2) Handling clones:
> > Currently, for clones not registered in call graph before auto-
> profile
> > pass, the tp_first_run field is copied from original function, when
> the clone is created.
> > However that may not correspond to the actual order of functions.
> >
> > For eg, if we have two profiled clones of foo:
> > foo.constprop.1, foo.constprop.2
> >
> > both will get same value for tp_first_run as foo->tp_first_run,
> which
> > might not correspond to time profile order.
> >
> > To address this, the patch introduces a new IPA pass
> > ipa_adjust_tp_first_run, that streams <clone name, tp_first_run>
> from
> > timestamp_info_map during LGEN, and during WPA reads it, and sets
> clone's tp_first_run field accordingly.
> > The pass is placed pretty late (just before locality_cloning), by
> that
> > point clones would be registered in the call graph.
> 
> This is also a problem we have with -fprofile-use.   Since for clones
> we
> know who is calling htem, perhaps we can just assign the time based on
> minimal first run of calling functions?
> (this may be too early in case the caller has a path avoiding callee,
> but may be reasonable heurstics)
> 
> Can you separate the patch so this goes separately? Copying the
> tp_first_run value is not terrible and we can discuss it more after
> basic ifrastructure goes in.
Sure. The attached patch only keeps the base infra, I will send the part adding
ipa_adjust_tp_first_run in a separate patch.
> >
> > Dhruv's sourcefile tracking patch already handles LTO privatized
> functions.
> > The patch adds a (temporary) workaround for functions with
> > mismatched/empty filenames from gcov, to avoid getting dropped in
> > afdo_annotate_cfg by iterating thru all filenames in
> afdo_string_table if get_function_instance_by_decl fails to find
> function_instance with lbasename (DECL_SOURCE_FILE (decl)).
> >
> > (3) Grouping profiled functions together in as few partitions as
> possible (preferably single).
> > The patch places profiled functions in time profile order together
> in
> > as few paritions as possible to get better advantage of code
> locality.
> > Unlike PGO, where every instrumented function gets a time profile
> > counter, with AutoFDO, the sampled functions are a fraction of the
> total executed ones.
> > Similarly, in default_function_section, it overrides hot/cold
> > partitioning so that grouping of profiled functions isn't disrupted.
> >
> > (4) Option to disable profile driven opts.
> > The patch adds option -fauto-profile-reorder-only which only enables
> > time-profile reordering with AutoFDO (and disables profile driven
> opts):
> > (a) Useful as a debugging aid to isolate regression to either
> function reordering or profile driven opts.
> > (b) For our use case, it's also seemingly useful as a stopgap
> measure
> > to avoid regressions with AutoFDO profile driven opts, due to issues
> with profile quality obtained with merging of SPE and non SPE
> profiles.
> > We're actively working on resolving this.
> > (c) Possibly useful for architectures which do not support branch
> sampling.
> > The option is disabled by default.
> >
> > Ideally, I would like to make it a param (and not user facing
> option),
> > but I am not able to control enabling/disabling options in
> opts.cc:common_handle_option based on param value, will investigate
> this further.
> 
> Yes, I think this should be a param.  I already added option to not
> read function profile (only scale existing guessed profile), so
> perhaps we can have --param auto-profile-* to control individual auto-
> profile features that can be used for debugging until things gets more
> reliable.
Done in the attached patch.
> 
> Also does the fature work with normal partitioning? It should make
> those function with tp_first_run nonzero go into early partitions, but
> then we will likely get relatively lot of calls to functions with zero
> counts just because we lost track of them?
The results I reported have been with balanced partitioning. I am seeing 
roughly the same numbers on top of locality partitioning too.
> 
> I also wonder if offlining can behave meaningfully here.  I.e. when
> function profile is non-zero just copy first run from the caller.
IIUC, the purpose of afdo_offline pass is to remove cross module inlining from 
function_instances so AutoFDO annotation
gets a better mapping from source locations to CFG ?

Currently, the patch only assigns timestamps to functions whose outline copy is 
executed during train run (toplevel).
Would afdo offlining help for instance if a function is inlined into all it's 
callers during train run (thus the outline copy has no timestamp info),
but during AutoFDO, it may not get inlined into all it's callers ? Currently in 
this case, I guess we lose timestamp info (and thus time profile reordering for 
such a function).

Currently the transform is gated explicitly on -fauto-profile 
-fprofile-reorder-functions (so users can opt-in instead of being enabled by 
default with -fauto-profile).
Should that be OK for gcc-16 ?
Does the patch look OK if bootstrap+test passes (with and without AutoFDO) ?

Signed-off-by: Prathamesh Kulkarni <[email protected]>

Thanks,
Prathamesh
> > diff --git a/gcc/auto-profile.cc b/gcc/auto-profile.cc index
> > ae604762165..58e583d5b87 100644
> > --- a/gcc/auto-profile.cc
> > +++ b/gcc/auto-profile.cc
> > @@ -54,6 +54,8 @@ along with GCC; see the file COPYING3.  If not see
> > #include "tree-pretty-print.h"
> >  #include "gimple-pretty-print.h"
> >  #include "output.h"
> > +#include "data-streamer.h"
> > +#include "lto-streamer.h"
> >
> >  /* The following routines implements AutoFDO optimization.
> >
> > @@ -263,6 +265,8 @@ public:
> >
> >    /* Return cgraph node corresponding to given name index.  */
> >    cgraph_node *get_cgraph_node (int);
> > +
> > +  const string_vector& filenames () { return filenames_; }
> >  private:
> >    typedef std::map<const char *, unsigned, string_compare>
> string_index_map;
> >    typedef std::map<const char *, auto_vec<unsigned>,
> string_compare>
> > @@ -294,7 +298,7 @@ public:
> >       too. STACK is used to track the recursive creation process.
> */
> >    static function_instance *
> >    read_function_instance (function_instance_stack *stack,
> > -                          gcov_type head_count);
> > +                       gcov_type head_count, bool toplevel = true);
> >
> >    /* Recursively deallocate all callsites (nested
> function_instances).  */
> >    ~function_instance ();
> > @@ -329,6 +333,12 @@ public:
> >      return head_count_;
> >    }
> >
> > +  gcov_type
> > +  timestamp () const
> > +  {
> > +    return timestamp_;
> > +  }
> > +
> >    /* Traverse callsites of the current function_instance to find
> one at the
> >       location of LINENO and callee name represented in DECL.
> >       LOCATION should match LINENO and is used to output
> diagnostics.
> > */ @@ -489,11 +499,13 @@ private:
> >    /* Map from callsite to callee function_instance.  */
> >    typedef std::map<callsite, function_instance *> callsite_map;
> >
> > -  function_instance (unsigned name, unsigned file_name, gcov_type
> > head_count)
> > +  function_instance (unsigned name, unsigned file_name, gcov_type
> head_count,
> > +                  gcov_type timestamp)
> >      : name_ (name), file_name_ (file_name), total_count_ (0),
> > -      head_count_ (head_count), removed_icall_target_ (false),
> > -      realized_ (false), in_worklist_ (false), inlined_to_ (NULL),
> > -      location_ (UNKNOWN_LOCATION), call_location_
> (UNKNOWN_LOCATION)
> > +      head_count_ (head_count), timestamp_ (timestamp),
> > +      removed_icall_target_ (false), realized_ (false),
> in_worklist_ (false),
> > +      inlined_to_ (NULL), location_ (UNKNOWN_LOCATION),
> > +      call_location_ (UNKNOWN_LOCATION)
> >    {
> >    }
> >
> > @@ -512,6 +524,9 @@ private:
> >    /* Entry BB's sample count.  */
> >    gcov_type head_count_;
> >
> > +  /* perf timestamp associated with first execution of function.
> */
> > + gcov_type timestamp_;
> > +
> >    /* Map from callsite location to callee function_instance.  */
> >    callsite_map callsites;
> >
> > @@ -559,7 +574,7 @@ public:
> >    ~autofdo_source_profile ();
> >
> >    /* For a given DECL, returns the top-level function_instance.  */
> > -  function_instance *get_function_instance_by_decl (tree decl)
> const;
> > +  function_instance *get_function_instance_by_decl (tree decl,
> const
> > + char * = NULL) const;
> >
> >    /* For a given name index (and original name), returns the top-
> level
> >     * function_instance.  */
> > @@ -632,6 +647,13 @@ static autofdo_source_profile
> > *afdo_source_profile;
> >  /* gcov_summary structure to store the profile_info.  */  static
> > gcov_summary *afdo_profile_info;
> >
> > +/* Map from timestamp -> <name, tp_first_run>.
> > +
> > +The purpose of this map is to map 64-bit timestamp values to
> (1..N),
> > +sorted by ascending order of timestamps and assign that to
> > +node->tp_first_run, since we don't need the full 64-bit range.  */
> > +static std::map<gcov_type, std::pair<int, int>> timestamp_info_map;
> > +
> >  /* Scaling factor for afdo data.  Compared to normal profile
> >     AFDO profile counts are much lower, depending on sampling
> >     frequency.  We scale data up to reudce effects of roundoff @@
> > -2477,7 +2499,7 @@
> autofdo_source_profile::offline_unrealized_inlines ()
> >         delete f;
> >       }
> >        /* If function is not in symbol table, remove it.  */
> > -      else if (!f->realized_p ())
> > +      else if (in_map && !f->realized_p ())
> >       {
> >         if (dump_file)
> >           fprintf (dump_file, "Removing optimized out function
> %s\n",
> > @@ -2502,6 +2524,7 @@
> > autofdo_source_profile::offline_unrealized_inlines ()
> >  /* function instance profile format:
> >
> >     ENTRY_COUNT: 8 bytes
> > +   TIMESTAMP: 8 bytes (only for toplevel symbols)
> >     NAME_INDEX: 4 bytes
> >     NUM_POS_COUNTS: 4 bytes
> >     NUM_CALLSITES: 4 byte
> > @@ -2528,14 +2551,17 @@
> > autofdo_source_profile::offline_unrealized_inlines ()
> >
> >  function_instance *
> >  function_instance::read_function_instance (function_instance_stack
> *stack,
> > -                                        gcov_type head_count)
> > +                                        gcov_type head_count, bool
> > + toplevel)
> >  {
> > +  gcov_type_unsigned timestamp = 0;
> > +  if (toplevel)
> > +    timestamp = (gcov_type_unsigned) gcov_read_counter ();
> >    unsigned name = gcov_read_unsigned ();
> >    unsigned num_pos_counts = gcov_read_unsigned ();
> >    unsigned num_callsites = gcov_read_unsigned ();
> >    function_instance *s
> >      = new function_instance (name, afdo_string_table-
> >get_filename_idx (name),
> > -                          head_count);
> > +                          head_count, timestamp);
> >    if (!stack->is_empty ())
> >      s->set_inlined_to (stack->last ());
> >    stack->safe_push (s);
> > @@ -2563,7 +2589,7 @@ function_instance::read_function_instance
> (function_instance_stack *stack,
> >      {
> >        unsigned offset = gcov_read_unsigned ();
> >        function_instance *callee_function_instance
> > -          = read_function_instance (stack, -1);
> > +     = read_function_instance (stack, -1, false);
> >        s->callsites[std::make_pair (offset,
> callee_function_instance->name ())]
> >            = callee_function_instance;
> >      }
> > @@ -2585,9 +2611,11 @@
> autofdo_source_profile::~autofdo_source_profile
> > ()
> >  /* For a given DECL, returns the top-level function_instance.  */
> >
> >  function_instance *
> > -autofdo_source_profile::get_function_instance_by_decl (tree decl)
> > const
> > +autofdo_source_profile::get_function_instance_by_decl (tree decl,
> > +const char *filename) const
> >  {
> > -  const char *filename = lbasename (DECL_SOURCE_FILE (decl));
> > +  if (!filename)
> > +    filename = lbasename (DECL_SOURCE_FILE (decl));
> > +
> >    int index = afdo_string_table->get_index_by_decl (decl);
> >    if (index == -1)
> >      return NULL;
> > @@ -2829,7 +2857,21 @@ autofdo_source_profile::read ()
> >       fatal_error (UNKNOWN_LOCATION,
> >                    "auto-profile contains duplicated function
> instance %s",
> >                    afdo_string_table->get_name (s->name ()));
> > +
> > +      timestamp_info_map.insert({s->timestamp (),
> > +                              std::make_pair (s->name (), 0)});
> >      }
> > +
> > +  /* timestamp_info_map is std::map with timestamp as key,
> > +     so it's already sorted in ascending order wrt timestamps.
> > +     This loop maps function with lowest timestamp to 1, and so on.
> > +     In afdo_annotate_cfg, node->tp_first_run is then set to
> corresponding
> > +     tp_first_run value.  */
> > +
> > +  int tp_first_run = 1;
> > +  for (auto &p : timestamp_info_map)
> > +    p.second.second = tp_first_run++;
> > +
> >    int hot_frac = param_hot_bb_count_fraction;
> >    /* Scale up the profile, but leave some bits in case some counts
> gets
> >       bigger than sum_max eventually.  */ @@ -4079,6 +4121,17 @@
> > afdo_annotate_cfg (void)
> >        = afdo_source_profile->get_function_instance_by_decl (
> >            current_function_decl);
> >
> > +  /* FIXME: This is a workaround for sourcefile tracking, if
> afdo_string_table
> > +     ends up with empty filename or incorrect filename for the
> function and
> > +     should be removed once issues with sourcefile tracking get
> > + fixed.  */  if (s == NULL)
> > +    for (unsigned i = 0; i < afdo_string_table->filenames ().length
> (); i++)
> > +      {
> > +     s = afdo_source_profile->get_function_instance_by_decl
> (current_function_decl, afdo_string_table->filenames()[i]);
> > +     if (s)
> > +       break;
> > +      }
> > +
> >    if (s == NULL)
> >      {
> >        if (dump_file)
> > @@ -4087,7 +4140,8 @@ afdo_annotate_cfg (void)
> >        /* create_gcov only dumps symbols with some samples in them.
> >        This means that we get nonempty zero_bbs only if some
> >        nonzero counts in profile were not matched with statements.
> */
> > -      if (!flag_profile_partial_training)
> > +      if (!flag_profile_partial_training
> > +       && !flag_auto_profile_reorder_only)
> >       {
> >         FOR_ALL_BB_FN (bb, cfun)
> >           if (bb->count.quality () == GUESSED_LOCAL) @@ -4097,6
> > +4151,20 @@ afdo_annotate_cfg (void)
> >        return;
> >      }
> >
> > +  if (auto it = timestamp_info_map.find (s->timestamp ());
> > +      it != timestamp_info_map.end ())
> > +    {
> > +      cgraph_node *node = cgraph_node::get (current_function_decl);
> > +      node->tp_first_run = it->second.second;
> > +
> > +      if (dump_file)
> > +     fprintf (dump_file, "Setting %s->tp_first_run to %d\n",
> > +              node->asm_name (), node->tp_first_run);
> > +    }
> > +
> > +  if (flag_auto_profile_reorder_only)
> > +    return;
> > +
> >    calculate_dominance_info (CDI_POST_DOMINATORS);
> >    calculate_dominance_info (CDI_DOMINATORS);
> >    loop_optimizer_init (0);
> > @@ -4496,3 +4564,182 @@ make_pass_ipa_auto_profile_offline
> > (gcc::context *ctxt)  {
> >    return new pass_ipa_auto_profile_offline (ctxt);  }
> > +
> > +/* Map from name -> tp_first_run for clones. We need this for
> initiailizing tp_first_run
> > +   of clones created after auto-profile pass.  */
> > +
> > +static std::map<const char *, int, autofdo::string_compare>
> > +clone_timestamp_map;
> 
> This probably should be symbol_summary, so it stays up to date with
> symbol table changes?
> > +
> > +/* Check if name is a clone function name. LTO privatized clones
> should already
> > +   be handled with sourcefile tracking.  */
> > +
> > +static bool
> > +clone_function_p (const char *name)
> > +{
> > +  return strchr (name, '.') && !strstr (name, "lto_priv"); }
> > +
> > +/* Gather <name, tp_first_run> value from
> > +autofdo::timestamp_info_map.  */
> > +
> > +static void
> > +adjust_tp_first_run_generate_summary (void) {
> > +  for (const auto &p : autofdo::timestamp_info_map)
> > +    {
> > +      gcov_type timestamp = p.first;
> > +      int name_id = p.second.first;
> > +      int tp_first_run = p.second.second;
> > +
> > +      if (const char *name = autofdo::afdo_string_table->get_name
> (name_id);
> > +       name && clone_function_p (name))
> > +     clone_timestamp_map.insert({xstrdup (name), tp_first_run});
> > +    }
> > +}
> > +
> > +/* Stream out <name, tp_first_run>.  */
> > +
> > +static void
> > +adjust_tp_first_run_write_summary (void) {
> > +  struct output_block *ob
> > +    = create_output_block (LTO_section_ipa_adjust_tp_first_run);
> > +
> > +  streamer_write_uhwi (ob, clone_timestamp_map.size ());  for
> (const
> > + auto &p : clone_timestamp_map)
> > +    {
> > +      streamer_write_string (ob, ob->main_stream, p.first, true);
> > +      streamer_write_hwi (ob, p.second);
> > +    }
> > +
> > +  produce_asm (ob);
> > +  destroy_output_block (ob);
> > +}
> > +
> > +/* Stream in <name, tp_first_run> and put it in
> clone_timestamp_map.
> > +*/
> > +
> > +static void
> > +adjust_tp_first_run_read_summary (void) {
> > +  struct lto_file_decl_data **file_data_vec =
> lto_get_file_decl_data
> > +();
> > +  struct lto_file_decl_data *file_data;
> > +  unsigned j = 0;
> > +
> > +  while ((file_data = file_data_vec[j++]))
> > +    {
> > +      size_t len;
> > +      const char *data
> > +     = lto_get_summary_section_data (file_data,
> > +
> LTO_section_ipa_adjust_tp_first_run,
> > +                                     &len);
> > +      if (!data)
> > +     continue;
> > +
> > +      const struct lto_function_header *header
> > +     = (const struct lto_function_header *) data;
> > +      const int cfg_offset = sizeof (struct lto_function_header);
> > +      const int main_offset = cfg_offset + header->cfg_size;
> > +      const int string_offset = main_offset + header->main_size;
> > +      struct data_in *data_in;
> > +
> > +      lto_input_block ib ((const char *) data + main_offset,
> > +                       header->main_size, file_data);
> > +
> > +      data_in
> > +     = lto_data_in_create (file_data, (const char *) data +
> string_offset,
> > +                           header->string_size, vNULL);
> > +
> > +      unsigned num_clones = streamer_read_uhwi (&ib);
> > +      for (unsigned i = 0; i < num_clones; i++)
> > +     {
> > +       const char *fn_name = streamer_read_string (data_in, &ib);
> > +       int tp_first_run = streamer_read_hwi (&ib);
> > +       clone_timestamp_map.insert ({xstrdup (fn_name),
> tp_first_run});
> > +     }
> > +
> > +      lto_free_section_data (file_data,
> LTO_section_ipa_adjust_tp_first_run,
> > +                          NULL, data, len);
> > +      lto_data_in_delete (data_in);
> > +    }
> > +}
> > +
> > +/* A late IPA pass for handling clones.
> > +   During auto-profile pass, for clones that aren't registered in
> call graph,
> > +   stream <name, tp_first_run> from timestamp_info_map during LGEN.
> > +   And during WPA, stream back the info, and assign the field
> value.
> > +   The pass is placed sufficiently late (just before
> locality_cloning), that
> > +   it should have clones registered in the call graph.  */
> > +
> > +static unsigned
> > +adjust_tp_first_run ()
> > +{
> > +  for (const auto &p : clone_timestamp_map)
> > +    if (cgraph_node *node
> > +     = cgraph_node::get_for_asmname (get_identifier (p.first)))
> > +      {
> > +     node->tp_first_run = p.second;
> > +     if (dump_file)
> > +       fprintf (dump_file, "Setting %s->tp_first_run to %d\n",
> > +                node->asm_name (), node->tp_first_run);
> > +      }
> > +
> > +  for (const auto &p : clone_timestamp_map)
> > +    free ((void *) p.first);
> > +  clone_timestamp_map.clear ();
> > +
> > +  return 0;
> > +}
> > +
> > +namespace {
> > +
> > +const pass_data pass_data_ipa_adjust_tp_first_run = {
> > +  IPA_PASS, /* type */
> > +  "adjust-tp_first_run", /* name */
> > +  OPTGROUP_NONE, /* optinfo_flags */
> > +  TV_CGRAPHOPT, /* tv_id */
> > +  0, /* properties_required */
> > +  0, /* properties_provided */
> > +  0, /* properties_destroyed */
> > +  0, /* todo_flags_start */
> > +  0, /* todo_flags_finish */
> > +};
> > +
> > +class pass_ipa_adjust_tp_first_run : public ipa_opt_pass_d {
> > +public:
> > +  pass_ipa_adjust_tp_first_run (gcc::context *ctxt)
> > +    : ipa_opt_pass_d (pass_data_ipa_adjust_tp_first_run, ctxt,
> > +                   adjust_tp_first_run_generate_summary, /*
> generate_summary */
> > +                   adjust_tp_first_run_write_summary, /*
> write_summary */
> > +                   adjust_tp_first_run_read_summary, /*
> read_summary */
> > +                   NULL, /* write_optimization_summary */
> > +                   NULL, /* read_optimization_summary */
> > +                   NULL, /* stmt_fixup */
> > +                      0, /* function_transform_todo_flags_start */
> > +                   NULL, /* function_transform */
> > +                   NULL) /* variable_transform */
> > +  {}
> > +
> > +  /* opt_pass methods: */
> > +  bool
> > +  gate (function *) final override
> > +  {
> > +    return flag_auto_profile;
> > +  }
> > +
> > +  unsigned
> > +  execute (function *) final override  {
> > +    return adjust_tp_first_run ();
> > +  }
> > +
> > +}; // class pass_ipa_adjust_tp_first_run
> > +
> > +} // anon namespace
> > +
> > +ipa_opt_pass_d *
> > +make_pass_ipa_adjust_tp_first_run (gcc::context *ctxt) {
> > +  return new pass_ipa_adjust_tp_first_run (ctxt); }
> > +
> > diff --git a/gcc/common.opt b/gcc/common.opt index
> > f6d93dc05fb..d7c028f87fe 100644
> > --- a/gcc/common.opt
> > +++ b/gcc/common.opt
> > @@ -1203,6 +1203,10 @@ fauto-profile-inlining  Common
> > Var(flag_auto_profile_inlining) Init(1) Optimization  Perform
> inlining
> > using auto-profile.
> >
> > +fauto-profile-reorder-only
> > +Common Var(flag_auto_profile_reorder_only) Init(0) Optimization
> > +Perform only function based reordering with fauto-profile.
> > +
> >  ; -fcheck-bounds causes gcc to generate array bounds checks.
> >  ; For C, C++ and ObjC: defaults off.
> >  ; For Java: defaults to on.
> > diff --git a/gcc/ipa-locality-cloning.cc b/gcc/ipa-locality-
> cloning.cc
> > index 2684046bd2d..1653ea7b961 100644
> > --- a/gcc/ipa-locality-cloning.cc
> > +++ b/gcc/ipa-locality-cloning.cc
> > @@ -994,6 +994,50 @@ locality_determine_static_order
> (auto_vec<locality_order *> *order)
> >    return true;
> >  }
> >
> > +/* Similar to lto-partition.cc:partition_by_timestamp.s  */
> > +
> > +static locality_partition
> > +partition_by_timestamps (int64_t max_partition_size, int&
> > +npartitions) {
> > +  vec<cgraph_node *> cnodes = vNULL;
> > +  cgraph_node *node;
> > +  FOR_EACH_DEFINED_FUNCTION (node)
> > +    if (node->tp_first_run > 0
> > +     && node->get_partitioning_class () == SYMBOL_PARTITION
> > +     && ipa_size_summaries->get (node))
> > +      cnodes.safe_push (node);
> > +  cnodes.qsort (tp_first_run_node_cmp);
> > +
> > +  unsigned n_partitioned_symbols = 0;
> > +
> > +  locality_partition partition = NULL;  unsigned i;
> FOR_EACH_VEC_ELT
> > + (cnodes, i, node)
> > +    {
> > +      if (node_partitioned_p (node))
> > +     continue;
> > +      ipa_size_summary *size_summary = ipa_size_summaries->get
> > + (node);
> > +
> > +      if (partition == NULL
> > +       || partition->insns + size_summary->size >
> max_partition_size)
> > +     partition = create_partition (npartitions);
> > +
> > +      add_node_to_partition (partition, node);
> > +      n_partitioned_symbols++;
> > +    }
> > +
> > +  if (dump_file)
> > +    {
> > +      fprintf (dump_file,
> > +            "partition_by_timestamps: Total partitions: %d\n",
> npartitions);
> > +      fprintf (dump_file,
> > +            "partition_by_timestamps: Number of partitioned
> symbols: %d\n",
> > +            n_partitioned_symbols);
> > +    }
> > +
> > +  return partition;
> > +}
> > +
> >  /* Partitioning for code locality.
> >     1. Create and sort callchains.  If PGO is available, use real
> profile
> >     counts.  Otherwise, use a set of heuristics to sort the
> callchains.
> > @@ -1007,7 +1051,7 @@ locality_partition_and_clone (int
> max_locality_partition_size,
> >                             lto_locality_cloning_model
> cloning_model,
> >                             int freq_denominator, int size)  {
> > -  locality_partition partition;
> > +  locality_partition partition = NULL;
> >    int npartitions = 0;
> >
> >    auto_vec<locality_order *> order;
> > @@ -1040,7 +1084,11 @@ locality_partition_and_clone (int
> max_locality_partition_size,
> >    int64_t partition_size
> >      = max_locality_partition_size
> >       ? max_locality_partition_size : param_max_partition_size;
> > -  partition = create_partition (npartitions);
> > +
> > +  if (flag_auto_profile && flag_profile_reorder_functions)
> > +    partition = partition_by_timestamps (partition_size,
> > + npartitions);  if (!partition)
> > +    partition = create_partition (npartitions);
> >
> >    for (unsigned i = 0; i < order.length (); i++)
> >      {
> > diff --git a/gcc/lto-section-in.cc b/gcc/lto-section-in.cc index
> > 1dd9520137a..2e2e30ea400 100644
> > --- a/gcc/lto-section-in.cc
> > +++ b/gcc/lto-section-in.cc
> > @@ -57,6 +57,7 @@ const char *lto_section_name[LTO_N_SECTION_TYPES]
> =
> >    "ipa_sra",
> >    "odr_types",
> >    "ipa_modref",
> > +  "ipa_adjust_tp_first_run",
> >  };
> >
> >  /* Hooks so that the ipa passes can call into the lto front end to
> > get diff --git a/gcc/lto-streamer.h b/gcc/lto-streamer.h index
> > 4b7209e3d82..242579918b0 100644
> > --- a/gcc/lto-streamer.h
> > +++ b/gcc/lto-streamer.h
> > @@ -225,6 +225,7 @@ enum lto_section_type
> >    LTO_section_ipa_sra,
> >    LTO_section_odr_types,
> >    LTO_section_ipa_modref,
> > +  LTO_section_ipa_adjust_tp_first_run,
> >    LTO_N_SECTION_TYPES                /* Must be last.  */
> >  };
> >
> > diff --git a/gcc/lto/lto-partition.cc b/gcc/lto/lto-partition.cc
> index
> > c7e69ee3161..c7937107ad7 100644
> > --- a/gcc/lto/lto-partition.cc
> > +++ b/gcc/lto/lto-partition.cc
> > @@ -87,6 +87,15 @@ new_partition (const char *name)
> >    return part;
> >  }
> >
> > +/* Create and return the created partition of name NAME.  */
> > +
> > +static ltrans_partition
> > +create_partition (int &npartitions, const char *name) {
> > +  npartitions++;
> > +  return new_partition (name);
> > +}
> > +
> >  /* Free memory used by ltrans partition.
> >     Encoder can be kept to be freed after streaming.  */  static
> void
> > @@ -1017,6 +1026,50 @@ lto_cache_map (int n_lto_partitions, int
> max_partition_size)
> >    partitioner.apply (partitions, n_lto_partitions);  }
> >
> > +/* Place symbols that are profiled in as few partitions as
> possible.
> > +*/
> > +
> > +static ltrans_partition
> > +partition_by_timestamps (int max_partition_size, int &npartitions)
> {
> > +  vec<cgraph_node *> cnodes = vNULL;
> > +  cgraph_node *node;
> > +  FOR_EACH_DEFINED_FUNCTION (node)
> > +    if (node->tp_first_run > 0
> > +     && node->get_partitioning_class () == SYMBOL_PARTITION
> > +     && ipa_size_summaries->get (node))
> > +      cnodes.safe_push (node);
> > +  cnodes.qsort (tp_first_run_node_cmp);
> > +
> > +  unsigned n_partitioned_symbols = 0;
> > +
> > +  ltrans_partition partition = NULL;
> > +  unsigned i;
> > +  FOR_EACH_VEC_ELT (cnodes, i, node)
> > +    {
> > +      if (symbol_partitioned_p (node))
> > +     continue;
> > +
> > +      ipa_size_summary *size_summary = ipa_size_summaries->get
> (node);
> > +      if (partition == NULL
> > +       || partition->insns + size_summary->size >
> max_partition_size)
> > +     partition = create_partition (npartitions, "");
> > +
> > +      add_symbol_to_partition (partition, node);
> > +      n_partitioned_symbols++;
> > +    }
> > +
> > +  if (dump_file)
> > +    {
> > +      fprintf (dump_file,
> > +            "partition_by_timestamps: Total partitions: %d\n",
> npartitions);
> > +      fprintf (dump_file,
> > +            "partition_by_timestamps: Number of partitioned
> symbols: %d\n",
> > +            n_partitioned_symbols);
> > +    }
> > +
> > +  return partition;
> > +}
> > +
> >  /* Group cgraph nodes into equally-sized partitions.
> >
> >     The partitioning algorithm is simple: nodes are taken in
> predefined order.
> > @@ -1072,7 +1125,7 @@ lto_balanced_map (int n_lto_partitions, int
> max_partition_size)
> >    int64_t cost = 0, internal = 0;
> >    unsigned int best_n_nodes = 0, best_i = 0;
> >    int64_t best_cost = -1, best_internal = 0, best_size = 0;
> > -  int npartitions;
> > +  int npartitions = 0;
> >    int current_order = -1;
> >    int noreorder_pos = 0;
> >
> > @@ -1092,6 +1145,19 @@ lto_balanced_map (int n_lto_partitions, int
> > max_partition_size)
> >
> >    original_total_size = total_size;
> >
> > +  /* With -fauto-profile -fprofile-reorder-functions, place all
> symbols which
> > +     are profiled together into as few partitions as possible. The
> rationale
> > +     for doing this with AutoFDO is that the number of sampled
> functions is a
> > +     fraction of total number of executed functions (unlike PGO
> where each
> > +     executed function gets instrumented with time profile
> counter), and
> > +     placing them together helps with code locality.  */
> > +
> > +  partition = NULL;
> > +  if (flag_auto_profile
> > +      && flag_profile_reorder_functions
> > +      && noreorder.length () == 0)
> > +    partition = partition_by_timestamps (max_partition_size,
> > + npartitions);
> > +
> >    /* Streaming works best when the source units do not cross
> partition
> >       boundaries much.  This is because importing function from a
> source
> >       unit tends to import a lot of global trees defined there.  We
> > should @@ -1126,8 +1192,12 @@ lto_balanced_map (int
> n_lto_partitions, int max_partition_size)
> >    partition_size = total_size / n_lto_partitions;
> >    if (partition_size < param_min_partition_size)
> >      partition_size = param_min_partition_size;
> > -  npartitions = 1;
> > -  partition = new_partition ("");
> > +
> > +  if (npartitions == 0)
> > +    {
> > +      npartitions = 1;
> > +      partition = new_partition ("");
> > +    }
> >    if (dump_file)
> >      fprintf (dump_file, "Total unit size: %" PRId64 ", partition
> size: %" PRId64 "\n",
> >            total_size, partition_size); @@ -1474,15 +1544,6 @@
> > add_node_references_to_partition (ltrans_partition partition,
> > symtab_node *node)
> >
> >  }
> >
> > -/* Create and return the created partition of name NAME.  */
> > -
> > -static ltrans_partition
> > -create_partition (int &npartitions, const char *name) -{
> > -  npartitions++;
> > -  return new_partition (name);
> > -}
> > -
> >  /* Partitioning for code locality.
> >     The partitioning plan (and prerequisite cloning) will have been
> done by the
> >     IPA locality cloning pass.  This function just implements that
> > plan by diff --git a/gcc/opts.cc b/gcc/opts.cc index
> > 10ce2c3de33..a5a9700dc08 100644
> > --- a/gcc/opts.cc
> > +++ b/gcc/opts.cc
> > @@ -3179,7 +3179,13 @@ common_handle_option (struct gcc_options
> *opts,
> >        /* No break here - do -fauto-profile processing. */
> >        /* FALLTHRU */
> >      case OPT_fauto_profile:
> > -      enable_fdo_optimizations (opts, opts_set, value, true);
> > +      if (opts->x_flag_auto_profile_reorder_only)
> > +     {
> > +       opts->x_flag_profile_reorder_functions = true;
> > +       opts->x_flag_auto_profile_inlining = false;
> > +     }
> > +      else
> > +     enable_fdo_optimizations (opts, opts_set, value, true);
> >        SET_OPTION_IF_UNSET (opts, opts_set, flag_profile_correction,
> value);
> >        break;
> >
> > diff --git a/gcc/passes.def b/gcc/passes.def index
> > 3f828477b68..bd0e80dfa5f 100644
> > --- a/gcc/passes.def
> > +++ b/gcc/passes.def
> > @@ -171,6 +171,7 @@ along with GCC; see the file COPYING3.  If not
> see
> >    NEXT_PASS (pass_ipa_sra);
> >    NEXT_PASS (pass_ipa_fn_summary);
> >    NEXT_PASS (pass_ipa_inline);
> > +  NEXT_PASS (pass_ipa_adjust_tp_first_run);
> >    NEXT_PASS (pass_ipa_locality_cloning);
> >    NEXT_PASS (pass_ipa_pure_const);
> >    NEXT_PASS (pass_ipa_modref);
> > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h index
> > 61cec52c624..8544a6bc552 100644
> > --- a/gcc/tree-pass.h
> > +++ b/gcc/tree-pass.h
> > @@ -553,6 +553,7 @@ extern ipa_opt_pass_d *make_pass_ipa_cdtor_merge
> > (gcc::context *ctxt);  extern ipa_opt_pass_d
> *make_pass_ipa_single_use
> > (gcc::context *ctxt);  extern ipa_opt_pass_d *make_pass_ipa_comdats
> > (gcc::context *ctxt);  extern ipa_opt_pass_d *make_pass_ipa_modref
> > (gcc::context *ctxt);
> > +extern ipa_opt_pass_d *make_pass_ipa_adjust_tp_first_run
> > +(gcc::context *ctxt);
> >  extern ipa_opt_pass_d *make_pass_ipa_locality_cloning (gcc::context
> > *ctxt);
> >
> >  extern gimple_opt_pass *make_pass_cleanup_cfg_post_optimizing
> > (gcc::context diff --git a/gcc/varasm.cc b/gcc/varasm.cc index
> > 0d78f5b384f..34673a970ea 100644
> > --- a/gcc/varasm.cc
> > +++ b/gcc/varasm.cc
> > @@ -610,6 +610,15 @@ default_function_section (tree decl, enum
> node_frequency freq,
> >    if (!flag_reorder_functions
> >        || !targetm_common.have_named_sections)
> >      return NULL;
> > +
> > +  /* With -fauto-profile -fprofile-reorder-functions, give higher
> priority
> > +     to time profile based reordering, and ensure the reordering
> isn't split
> > +     by hot/cold partitioning.  */
> > +  if (flag_auto_profile
> > +      && flag_profile_reorder_functions
> > +      && cgraph_node::get (decl)->tp_first_run > 0)
> > +    return NULL;
> > +
> >    /* Startup code should go to startup subsection unless it is
> >       unlikely executed (this happens especially with function
> splitting
> >       where we can split away unnecessary parts of static
> > constructors.  */

Enable time profile function reordering with AutoFDO.

The patch enables time profile based reordering with AutoFDO with
-fauto-profile -fprofile-reorder-functions, by mapping timestamps obtained from 
perf
into node->tp_first_run. 

The rationale for doing this is:
(1) GCC already implements time-profile function reordering with PGO, the patch 
enables
it with AutoFDO.
(2) While time profile ordering is primarly meant for optimizing startup time,
we've also observed good effects on code-locality for large internal workloads.
(3) Possibly useful for function reordering when accurate profile annotation is
hard with AutoFDO -- For eg, if branch samples are missing (due to absence of
LBR like structure).

On AutoFDO tools side, a corresponding patch extends gcov to emit 64-bit perf 
timestamp that
records first execution of function, which loosely corresponds to PGO's 
time_profile counter.
The timestamp is stored adjacent to head field in toplevel function info.

On GCC side, this patch makes the following changes:

(1) Changes to auto-profile pass:
The patch adds a new field timestamp to function_instance,
and populates it in read_function_instance. 
It maintains a new timestamp_info_map from timestamp -> <name, tp_first_run>,
which maps timestamps sorted in ascending order to (1..N), so lowest ordered
timestamp is mapped to 1 and so on. The rationale for this is that
timestamps are 64-bit integers, and we don't need the full 64-bit range
for ordering by tp_first_run.

During annotation, the timestamp associated with function_instance is looked up
in timestamp_info_map, and corresponding mapped value is assigned
to node->tp_first_run.

Dhruv's sourcefile tracking patch already handles LTO privatized symbols.
The patch adds a workaround for mismatched/empty filenames, which should go away
when the issues with AutoFDO tools dwarf parsing are resolved.

(2) Grouping profiled functions together in as few partitions as possible:
The patch places profiled functions in time profile order together in as
few paritions as possible to get better advantage of code locality.
Unlike PGO, where every instrumented function gets a time profile,
with AutoFDO, the sampled functions are a fraction of the total executed ones.
Similarly, in default_function_section, it overrides hot/cold partitioning so
that grouping of profiled functions isn't disrupted.

(3) Param to disable profile driven opts.
The patch adds param auto-profile-reorder-only which only enables time-profile 
reordering with
AutoFDO:
(a) Useful as a debugging aid to isolate regression to either function 
reordering or profile driven opts.
(b) As a stopgap measure to avoid regressions with AutoFDO profile driven opts. 
(c) Possibly useful for architectures which do not support branch sampling.

gcc/ChangeLog:
        * auto-profile.cc: (string_table::filenames): New method.
        (function_instance::timestamp_): New member.
        (function_instance::timestamp): New accessor for timestamp_ member.
        (autofdo::timestamp_info_map): New std::map.
        (function_instance::function_instance): Add argument timestamp and set
        it to member timestamp_.
        (function_instance::read_function_instance): Adjust prototype and read
        timestamp.
        (autofdo_source_profile::get_function_instance_by_decl): New argument
        filename with default value NULL.
        (autofdo_source_profile::read): Populate timestamp_info_map.
        (afdo_annotate_cfg): Assign node->tp_first_run based on
        timestamp_info_map and bail out of annotation if
        param auto-profile-reorder-only is enabled.
        * params.opt: New param auto-profile-reorder-only.
        * ipa-locality-cloning.cc (partition_by_timestamps): New function.
        (locality_partition_and_clone): Call partition_by_timestamps.
        * varasm.cc (default_function_section): Bail out if -fauto-profile
        -fprofile-reorder-functions is enabled and node->tp_first_run > 0.

gcc/lto/ChangeLog:
        * lto-partition.cc (create_partition): Move higher up in the file.
        (partition_by_timestamps): New function.
        (lto_balanced_map): Call partition_by_timestamps if -fauto-profile
        -fprofile-reorder-functions is passed and noreroder is empty.

Signed-off-by: Prathamesh Kulkarni <[email protected]>

diff --git a/gcc/auto-profile.cc b/gcc/auto-profile.cc
index 1e50d8e6e71..cdb6dad06e5 100644
--- a/gcc/auto-profile.cc
+++ b/gcc/auto-profile.cc
@@ -281,6 +281,8 @@ public:
 
   /* Return cgraph node corresponding to given name index.  */
   cgraph_node *get_cgraph_node (int);
+
+  const string_vector& filenames () { return filenames_; }
 private:
   typedef std::map<const char *, unsigned, string_compare> string_index_map;
   typedef std::map<const char *, auto_vec<unsigned>, string_compare>
@@ -312,7 +314,7 @@ public:
      too. STACK is used to track the recursive creation process.  */
   static function_instance *
   read_function_instance (function_instance_stack *stack,
-                          gcov_type head_count);
+                         gcov_type head_count, bool toplevel = true);
 
   /* Recursively deallocate all callsites (nested function_instances).  */
   ~function_instance ();
@@ -347,6 +349,12 @@ public:
     return head_count_;
   }
 
+  gcov_type
+  timestamp () const
+  {
+    return timestamp_;
+  }
+
   /* Traverse callsites of the current function_instance to find one at the
      location of LINENO and callee name represented in DECL.
      LOCATION should match LINENO and is used to output diagnostics.  */
@@ -507,11 +515,13 @@ private:
   /* Map from callsite to callee function_instance.  */
   typedef std::map<callsite, function_instance *> callsite_map;
 
-  function_instance (unsigned name, unsigned file_name, gcov_type head_count)
+  function_instance (unsigned name, unsigned file_name, gcov_type head_count,
+                    gcov_type timestamp)
     : name_ (name), file_name_ (file_name), total_count_ (0),
-      head_count_ (head_count), removed_icall_target_ (false),
-      realized_ (false), in_worklist_ (false), inlined_to_ (NULL),
-      location_ (UNKNOWN_LOCATION), call_location_ (UNKNOWN_LOCATION)
+      head_count_ (head_count), timestamp_ (timestamp),
+      removed_icall_target_ (false), realized_ (false), in_worklist_ (false),
+      inlined_to_ (NULL), location_ (UNKNOWN_LOCATION),
+      call_location_ (UNKNOWN_LOCATION)
   {
   }
 
@@ -530,6 +540,9 @@ private:
   /* Entry BB's sample count.  */
   gcov_type head_count_;
 
+  /* perf timestamp associated with first execution of function.  */
+  gcov_type timestamp_;
+
   /* Map from callsite location to callee function_instance.  */
   callsite_map callsites;
 
@@ -577,7 +590,7 @@ public:
   ~autofdo_source_profile ();
 
   /* For a given DECL, returns the top-level function_instance.  */
-  function_instance *get_function_instance_by_decl (tree decl) const;
+  function_instance *get_function_instance_by_decl (tree decl, const char * = 
NULL) const;
 
   /* For a given name index (and original name), returns the top-level
    * function_instance.  */
@@ -650,6 +663,13 @@ static autofdo_source_profile *afdo_source_profile;
 /* gcov_summary structure to store the profile_info.  */
 static gcov_summary *afdo_profile_info;
 
+/* Map from timestamp -> <name, tp_first_run>.
+
+The purpose of this map is to map 64-bit timestamp values to (1..N),
+sorted by ascending order of timestamps and assign that to node->tp_first_run,
+since we don't need the full 64-bit range.  */
+static std::map<gcov_type, std::pair<int, int>> timestamp_info_map;
+
 /* Scaling factor for afdo data.  Compared to normal profile
    AFDO profile counts are much lower, depending on sampling
    frequency.  We scale data up to reudce effects of roundoff
@@ -2523,6 +2543,7 @@ autofdo_source_profile::offline_unrealized_inlines ()
 /* function instance profile format:
 
    ENTRY_COUNT: 8 bytes
+   TIMESTAMP: 8 bytes (only for toplevel symbols)
    NAME_INDEX: 4 bytes
    NUM_POS_COUNTS: 4 bytes
    NUM_CALLSITES: 4 byte
@@ -2549,14 +2570,17 @@ autofdo_source_profile::offline_unrealized_inlines ()
 
 function_instance *
 function_instance::read_function_instance (function_instance_stack *stack,
-                                          gcov_type head_count)
+                                          gcov_type head_count, bool toplevel)
 {
+  gcov_type_unsigned timestamp = 0;
+  if (toplevel)
+    timestamp = (gcov_type_unsigned) gcov_read_counter ();
   unsigned name = gcov_read_unsigned ();
   unsigned num_pos_counts = gcov_read_unsigned ();
   unsigned num_callsites = gcov_read_unsigned ();
   function_instance *s
     = new function_instance (name, afdo_string_table->get_filename_idx (name),
-                            head_count);
+                            head_count, timestamp);
   if (!stack->is_empty ())
     s->set_inlined_to (stack->last ());
   stack->safe_push (s);
@@ -2584,7 +2608,7 @@ function_instance::read_function_instance 
(function_instance_stack *stack,
     {
       unsigned offset = gcov_read_unsigned ();
       function_instance *callee_function_instance
-          = read_function_instance (stack, -1);
+       = read_function_instance (stack, -1, false);
       s->callsites[std::make_pair (offset, callee_function_instance->name ())]
           = callee_function_instance;
     }
@@ -2606,9 +2630,11 @@ autofdo_source_profile::~autofdo_source_profile ()
 /* For a given DECL, returns the top-level function_instance.  */
 
 function_instance *
-autofdo_source_profile::get_function_instance_by_decl (tree decl) const
+autofdo_source_profile::get_function_instance_by_decl (tree decl, const char 
*filename) const
 {
-  const char *filename = lbasename (DECL_SOURCE_FILE (decl));
+  if (!filename)
+    filename = lbasename (DECL_SOURCE_FILE (decl));
+
   int index = afdo_string_table->get_index_by_decl (decl);
   if (index == -1)
     return NULL;
@@ -2851,7 +2877,21 @@ autofdo_source_profile::read ()
        fatal_error (UNKNOWN_LOCATION,
                     "auto-profile contains duplicated function instance %s",
                     afdo_string_table->get_name (s->name ()));
+
+      timestamp_info_map.insert({s->timestamp (),
+                                std::make_pair (s->name (), 0)});
     }
+
+  /* timestamp_info_map is std::map with timestamp as key,
+     so it's already sorted in ascending order wrt timestamps.
+     This loop maps function with lowest timestamp to 1, and so on.
+     In afdo_annotate_cfg, node->tp_first_run is then set to corresponding
+     tp_first_run value.  */
+
+  int tp_first_run = 1;
+  for (auto &p : timestamp_info_map)
+    p.second.second = tp_first_run++;
+
   int hot_frac = param_hot_bb_count_fraction;
   /* Scale up the profile, but leave some bits in case some counts gets
      bigger than sum_max eventually.  */
@@ -4214,6 +4254,17 @@ afdo_annotate_cfg (void)
       = afdo_source_profile->get_function_instance_by_decl (
           current_function_decl);
 
+  /* FIXME: This is a workaround for sourcefile tracking, if afdo_string_table
+     ends up with empty filename or incorrect filename for the function and
+     should be removed once issues with sourcefile tracking get fixed.  */
+  if (s == NULL)
+    for (unsigned i = 0; i < afdo_string_table->filenames ().length (); i++)
+      {
+       s = afdo_source_profile->get_function_instance_by_decl 
(current_function_decl, afdo_string_table->filenames()[i]);
+       if (s)
+         break;
+      }
+
   if (s == NULL)
     {
       if (dump_file)
@@ -4222,7 +4273,8 @@ afdo_annotate_cfg (void)
       /* create_gcov only dumps symbols with some samples in them.
         This means that we get nonempty zero_bbs only if some
         nonzero counts in profile were not matched with statements.  */
-      if (!flag_profile_partial_training)
+      if (!flag_profile_partial_training
+         && !param_auto_profile_reorder_only)
        {
          FOR_ALL_BB_FN (bb, cfun)
            if (bb->count.quality () == GUESSED_LOCAL)
@@ -4232,6 +4284,20 @@ afdo_annotate_cfg (void)
       return;
     }
 
+  if (auto it = timestamp_info_map.find (s->timestamp ());
+      it != timestamp_info_map.end ())
+    {
+      cgraph_node *node = cgraph_node::get (current_function_decl);
+      node->tp_first_run = it->second.second;
+
+      if (dump_file)
+       fprintf (dump_file, "Setting %s->tp_first_run to %d\n",
+                node->asm_name (), node->tp_first_run);
+    }
+
+  if (param_auto_profile_reorder_only)
+    return;
+
   calculate_dominance_info (CDI_POST_DOMINATORS);
   calculate_dominance_info (CDI_DOMINATORS);
   loop_optimizer_init (0);
diff --git a/gcc/ipa-locality-cloning.cc b/gcc/ipa-locality-cloning.cc
index 2684046bd2d..1653ea7b961 100644
--- a/gcc/ipa-locality-cloning.cc
+++ b/gcc/ipa-locality-cloning.cc
@@ -994,6 +994,50 @@ locality_determine_static_order (auto_vec<locality_order 
*> *order)
   return true;
 }
 
+/* Similar to lto-partition.cc:partition_by_timestamp.s  */
+
+static locality_partition
+partition_by_timestamps (int64_t max_partition_size, int& npartitions)
+{
+  vec<cgraph_node *> cnodes = vNULL;
+  cgraph_node *node;
+  FOR_EACH_DEFINED_FUNCTION (node)
+    if (node->tp_first_run > 0
+       && node->get_partitioning_class () == SYMBOL_PARTITION
+       && ipa_size_summaries->get (node))
+      cnodes.safe_push (node);
+  cnodes.qsort (tp_first_run_node_cmp);
+
+  unsigned n_partitioned_symbols = 0;
+
+  locality_partition partition = NULL;
+  unsigned i;
+  FOR_EACH_VEC_ELT (cnodes, i, node)
+    {
+      if (node_partitioned_p (node))
+       continue;
+      ipa_size_summary *size_summary = ipa_size_summaries->get (node);
+
+      if (partition == NULL
+         || partition->insns + size_summary->size > max_partition_size)
+       partition = create_partition (npartitions);
+
+      add_node_to_partition (partition, node);
+      n_partitioned_symbols++;
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file,
+              "partition_by_timestamps: Total partitions: %d\n", npartitions);
+      fprintf (dump_file,
+              "partition_by_timestamps: Number of partitioned symbols: %d\n",
+              n_partitioned_symbols);
+    }
+
+  return partition;
+}
+
 /* Partitioning for code locality.
    1. Create and sort callchains.  If PGO is available, use real profile
    counts.  Otherwise, use a set of heuristics to sort the callchains.
@@ -1007,7 +1051,7 @@ locality_partition_and_clone (int 
max_locality_partition_size,
                              lto_locality_cloning_model cloning_model,
                              int freq_denominator, int size)
 {
-  locality_partition partition;
+  locality_partition partition = NULL;
   int npartitions = 0;
 
   auto_vec<locality_order *> order;
@@ -1040,7 +1084,11 @@ locality_partition_and_clone (int 
max_locality_partition_size,
   int64_t partition_size
     = max_locality_partition_size
        ? max_locality_partition_size : param_max_partition_size;
-  partition = create_partition (npartitions);
+
+  if (flag_auto_profile && flag_profile_reorder_functions)
+    partition = partition_by_timestamps (partition_size, npartitions);
+  if (!partition)
+    partition = create_partition (npartitions);
 
   for (unsigned i = 0; i < order.length (); i++)
     {
diff --git a/gcc/lto/lto-partition.cc b/gcc/lto/lto-partition.cc
index 8c6ec6f8338..eb57c357577 100644
--- a/gcc/lto/lto-partition.cc
+++ b/gcc/lto/lto-partition.cc
@@ -97,6 +97,15 @@ create_partition_if_empty ()
     new_partition ("empty");
 }
 
+/* Create and return the created partition of name NAME.  */
+
+static ltrans_partition
+create_partition (int &npartitions, const char *name)
+{
+  npartitions++;
+  return new_partition (name);
+}
+
 /* Free memory used by ltrans partition.
    Encoder can be kept to be freed after streaming.  */
 static void
@@ -1059,6 +1068,50 @@ lto_cache_map (int n_lto_partitions, int 
max_partition_size)
   partitioner.apply (partitions, n_lto_partitions);
 }
 
+/* Place symbols that are profiled in as few partitions as possible.  */
+
+static ltrans_partition
+partition_by_timestamps (int max_partition_size, int &npartitions)
+{
+  vec<cgraph_node *> cnodes = vNULL;
+  cgraph_node *node;
+  FOR_EACH_DEFINED_FUNCTION (node)
+    if (node->tp_first_run > 0
+       && node->get_partitioning_class () == SYMBOL_PARTITION
+       && ipa_size_summaries->get (node))
+      cnodes.safe_push (node);
+  cnodes.qsort (tp_first_run_node_cmp);
+
+  unsigned n_partitioned_symbols = 0;
+
+  ltrans_partition partition = NULL;
+  unsigned i;
+  FOR_EACH_VEC_ELT (cnodes, i, node)
+    {
+      if (symbol_partitioned_p (node))
+       continue;
+
+      ipa_size_summary *size_summary = ipa_size_summaries->get (node);
+      if (partition == NULL
+         || partition->insns + size_summary->size > max_partition_size)
+       partition = create_partition (npartitions, "");
+
+      add_symbol_to_partition (partition, node);
+      n_partitioned_symbols++;
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file,
+              "partition_by_timestamps: Total partitions: %d\n", npartitions);
+      fprintf (dump_file,
+              "partition_by_timestamps: Number of partitioned symbols: %d\n",
+              n_partitioned_symbols);
+    }
+
+  return partition;
+}
+
 /* Group cgraph nodes into equally-sized partitions.
 
    The partitioning algorithm is simple: nodes are taken in predefined order.
@@ -1114,7 +1167,7 @@ lto_balanced_map (int n_lto_partitions, int 
max_partition_size)
   int64_t cost = 0, internal = 0;
   unsigned int best_n_nodes = 0, best_i = 0;
   int64_t best_cost = -1, best_internal = 0, best_size = 0;
-  int npartitions;
+  int npartitions = 0;
   int current_order = -1;
   int noreorder_pos = 0;
 
@@ -1134,6 +1187,19 @@ lto_balanced_map (int n_lto_partitions, int 
max_partition_size)
 
   original_total_size = total_size;
 
+  /* With -fauto-profile -fprofile-reorder-functions, place all symbols which
+     are profiled together into as few partitions as possible. The rationale
+     for doing this with AutoFDO is that the number of sampled functions is a
+     fraction of total number of executed functions (unlike PGO where each
+     executed function gets instrumented with time profile counter), and
+     placing them together helps with code locality.  */
+
+  partition = NULL;
+  if (flag_auto_profile
+      && flag_profile_reorder_functions
+      && noreorder.length () == 0)
+    partition = partition_by_timestamps (max_partition_size, npartitions);
+
   /* Streaming works best when the source units do not cross partition
      boundaries much.  This is because importing function from a source
      unit tends to import a lot of global trees defined there.  We should
@@ -1168,8 +1234,12 @@ lto_balanced_map (int n_lto_partitions, int 
max_partition_size)
   partition_size = total_size / n_lto_partitions;
   if (partition_size < param_min_partition_size)
     partition_size = param_min_partition_size;
-  npartitions = 1;
-  partition = new_partition ("");
+
+  if (npartitions == 0)
+    {
+      npartitions = 1;
+      partition = new_partition ("");
+    }
   if (dump_file)
     fprintf (dump_file, "Total unit size: %" PRId64 ", partition size: %" 
PRId64 "\n",
             total_size, partition_size);
@@ -1512,15 +1582,6 @@ add_node_references_to_partition (ltrans_partition 
partition, symtab_node *node)
 
 }
 
-/* Create and return the created partition of name NAME.  */
-
-static ltrans_partition
-create_partition (int &npartitions, const char *name)
-{
-  npartitions++;
-  return new_partition (name);
-}
-
 /* Partitioning for code locality.
    The partitioning plan (and prerequisite cloning) will have been done by the
    IPA locality cloning pass.  This function just implements that plan by
diff --git a/gcc/params.opt b/gcc/params.opt
index ad5ee887367..936c5a2a116 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -70,6 +70,10 @@ Enable asan detection of use-after-return bugs.
 Common Joined UInteger Var(param_auto_profile_bbs) Init(1) IntegerRange(0, 1) 
Param Optimization
 Build basic block profile using auto profile.
 
+-param=auto-profile-reorder-only=
+Common Joined UInteger Var(param_auto_profile_reorder_only) Init(0) 
IntegerRange(0, 1) Param Optimization
+Eanble only function reordering with auto-profile.
+
 -param=cycle-accurate-model=
 Common Joined UInteger Var(param_cycle_accurate_model) Init(1) IntegerRange(0, 
1) Param Optimization
 Whether the scheduling description is mostly a cycle-accurate model of the 
target processor and is likely to be spill aggressively to fill any pipeline 
bubbles.
diff --git a/gcc/varasm.cc b/gcc/varasm.cc
index 0d78f5b384f..34673a970ea 100644
--- a/gcc/varasm.cc
+++ b/gcc/varasm.cc
@@ -610,6 +610,15 @@ default_function_section (tree decl, enum node_frequency 
freq,
   if (!flag_reorder_functions
       || !targetm_common.have_named_sections)
     return NULL;
+
+  /* With -fauto-profile -fprofile-reorder-functions, give higher priority
+     to time profile based reordering, and ensure the reordering isn't split
+     by hot/cold partitioning.  */
+  if (flag_auto_profile
+      && flag_profile_reorder_functions
+      && cgraph_node::get (decl)->tp_first_run > 0)
+    return NULL;
+
   /* Startup code should go to startup subsection unless it is
      unlikely executed (this happens especially with function splitting
      where we can split away unnecessary parts of static constructors.  */

Reply via email to