> (1) Changes to auto-profile pass:
> The patch adds a new field timestamp to function_instance,
> and populates it in read_function_instance.
> 
> It maintains a new timestamp_info_map from timestamp -> <name, tp_first_run>,
> which maps timestamps sorted in ascending order to (1..N), so lowest ordered
> timestamp is mapped to 1 and so on. The rationale for this is that
> timestamps are 64-bit integers, and we don't need the full 64-bit range
> for ordering by tp_first_run.
> 
> During annotation, the timestamp associated with function_instance is looked 
> up
> in timestamp_info_map, and corresponding mapped value is assigned
> to node->tp_first_run.

I assume that the autofdo assigns timestamps to offile function
instances, so this can also fit into function_instance datastructure,
but I also suppose you want to keep it separate so it is optional?
> 
> (2) Handling clones:
> Currently, for clones not registered in call graph before auto-profile pass,
> the tp_first_run field is copied from original function, when the clone is 
> created.
> However that may not correspond to the actual order of functions.
> 
> For eg, if we have two profiled clones of foo:
> foo.constprop.1, foo.constprop.2
> 
> both will get same value for tp_first_run as foo->tp_first_run, which might
> not correspond to time profile order.
> 
> To address this, the patch introduces a new IPA pass ipa_adjust_tp_first_run,
> that streams <clone name, tp_first_run> from timestamp_info_map during LGEN, 
> and
> during WPA reads it, and sets clone's tp_first_run field accordingly.
> The pass is placed pretty late (just before locality_cloning), by that
> point clones would be registered in the call graph.

This is also a problem we have with -fprofile-use.   Since for clones we
know who is calling htem, perhaps we can just assign the time based on
minimal first run of calling functions?
(this may be too early in case the caller has a path avoiding callee,
but may be reasonable heurstics)

Can you separate the patch so this goes separately? Copying the
tp_first_run value is not terrible and we can discuss it more after
basic ifrastructure goes in.
> 
> Dhruv's sourcefile tracking patch already handles LTO privatized functions.
> The patch adds a (temporary) workaround for functions with mismatched/empty 
> filenames from gcov,
> to avoid getting dropped in afdo_annotate_cfg by iterating thru all filenames 
> in afdo_string_table if
> get_function_instance_by_decl fails to find function_instance with lbasename 
> (DECL_SOURCE_FILE (decl)).
> 
> (3) Grouping profiled functions together in as few partitions as possible 
> (preferably single).
> The patch places profiled functions in time profile order together in as
> few paritions as possible to get better advantage of code locality.
> Unlike PGO, where every instrumented function gets a time profile counter,
> with AutoFDO, the sampled functions are a fraction of the total executed ones.
> Similarly, in default_function_section, it overrides hot/cold partitioning so
> that grouping of profiled functions isn't disrupted.
> 
> (4) Option to disable profile driven opts.
> The patch adds option -fauto-profile-reorder-only which only enables 
> time-profile reordering with
> AutoFDO (and disables profile driven opts):
> (a) Useful as a debugging aid to isolate regression to either function 
> reordering or profile driven opts.
> (b) For our use case, it's also seemingly useful as a stopgap measure to 
> avoid regressions with AutoFDO
> profile driven opts, due to issues with profile quality obtained with merging 
> of SPE and non SPE profiles.
> We're actively working on resolving this.
> (c) Possibly useful for architectures which do not support branch sampling.
> The option is disabled by default.
> 
> Ideally, I would like to make it a param (and not user facing option), but I 
> am not able to control enabling/disabling options in 
> opts.cc:common_handle_option
> based on param value, will investigate this further.

Yes, I think this should be a param.  I already added option to not read
function profile (only scale existing guessed profile), so perhaps we
can have --param auto-profile-* to control individual auto-profile
features that can be used for debugging until things gets more reliable.

Also does the fature work with normal partitioning? It should make those
function with tp_first_run nonzero go into early partitions, but then we
will likely get relatively lot of calls to functions with zero counts
just because we lost track of them?

I also wonder if offlining can behave meaningfully here.  I.e. when
function profile is non-zero just copy first run from the caller.
> diff --git a/gcc/auto-profile.cc b/gcc/auto-profile.cc
> index ae604762165..58e583d5b87 100644
> --- a/gcc/auto-profile.cc
> +++ b/gcc/auto-profile.cc
> @@ -54,6 +54,8 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-pretty-print.h"
>  #include "gimple-pretty-print.h"
>  #include "output.h"
> +#include "data-streamer.h"
> +#include "lto-streamer.h"
>  
>  /* The following routines implements AutoFDO optimization.
>  
> @@ -263,6 +265,8 @@ public:
>  
>    /* Return cgraph node corresponding to given name index.  */
>    cgraph_node *get_cgraph_node (int);
> +
> +  const string_vector& filenames () { return filenames_; }
>  private:
>    typedef std::map<const char *, unsigned, string_compare> string_index_map;
>    typedef std::map<const char *, auto_vec<unsigned>, string_compare>
> @@ -294,7 +298,7 @@ public:
>       too. STACK is used to track the recursive creation process.  */
>    static function_instance *
>    read_function_instance (function_instance_stack *stack,
> -                          gcov_type head_count);
> +                       gcov_type head_count, bool toplevel = true);
>  
>    /* Recursively deallocate all callsites (nested function_instances).  */
>    ~function_instance ();
> @@ -329,6 +333,12 @@ public:
>      return head_count_;
>    }
>  
> +  gcov_type
> +  timestamp () const
> +  {
> +    return timestamp_;
> +  }
> +
>    /* Traverse callsites of the current function_instance to find one at the
>       location of LINENO and callee name represented in DECL.
>       LOCATION should match LINENO and is used to output diagnostics.  */
> @@ -489,11 +499,13 @@ private:
>    /* Map from callsite to callee function_instance.  */
>    typedef std::map<callsite, function_instance *> callsite_map;
>  
> -  function_instance (unsigned name, unsigned file_name, gcov_type head_count)
> +  function_instance (unsigned name, unsigned file_name, gcov_type head_count,
> +                  gcov_type timestamp)
>      : name_ (name), file_name_ (file_name), total_count_ (0),
> -      head_count_ (head_count), removed_icall_target_ (false),
> -      realized_ (false), in_worklist_ (false), inlined_to_ (NULL),
> -      location_ (UNKNOWN_LOCATION), call_location_ (UNKNOWN_LOCATION)
> +      head_count_ (head_count), timestamp_ (timestamp),
> +      removed_icall_target_ (false), realized_ (false), in_worklist_ (false),
> +      inlined_to_ (NULL), location_ (UNKNOWN_LOCATION),
> +      call_location_ (UNKNOWN_LOCATION)
>    {
>    }
>  
> @@ -512,6 +524,9 @@ private:
>    /* Entry BB's sample count.  */
>    gcov_type head_count_;
>  
> +  /* perf timestamp associated with first execution of function.  */
> +  gcov_type timestamp_;
> +
>    /* Map from callsite location to callee function_instance.  */
>    callsite_map callsites;
>  
> @@ -559,7 +574,7 @@ public:
>    ~autofdo_source_profile ();
>  
>    /* For a given DECL, returns the top-level function_instance.  */
> -  function_instance *get_function_instance_by_decl (tree decl) const;
> +  function_instance *get_function_instance_by_decl (tree decl, const char * 
> = NULL) const;
>  
>    /* For a given name index (and original name), returns the top-level
>     * function_instance.  */
> @@ -632,6 +647,13 @@ static autofdo_source_profile *afdo_source_profile;
>  /* gcov_summary structure to store the profile_info.  */
>  static gcov_summary *afdo_profile_info;
>  
> +/* Map from timestamp -> <name, tp_first_run>.
> +
> +The purpose of this map is to map 64-bit timestamp values to (1..N),
> +sorted by ascending order of timestamps and assign that to 
> node->tp_first_run,
> +since we don't need the full 64-bit range.  */
> +static std::map<gcov_type, std::pair<int, int>> timestamp_info_map;
> +
>  /* Scaling factor for afdo data.  Compared to normal profile
>     AFDO profile counts are much lower, depending on sampling
>     frequency.  We scale data up to reudce effects of roundoff
> @@ -2477,7 +2499,7 @@ autofdo_source_profile::offline_unrealized_inlines ()
>         delete f;
>       }
>        /* If function is not in symbol table, remove it.  */
> -      else if (!f->realized_p ())
> +      else if (in_map && !f->realized_p ())
>       {
>         if (dump_file)
>           fprintf (dump_file, "Removing optimized out function %s\n",
> @@ -2502,6 +2524,7 @@ autofdo_source_profile::offline_unrealized_inlines ()
>  /* function instance profile format:
>  
>     ENTRY_COUNT: 8 bytes
> +   TIMESTAMP: 8 bytes (only for toplevel symbols)
>     NAME_INDEX: 4 bytes
>     NUM_POS_COUNTS: 4 bytes
>     NUM_CALLSITES: 4 byte
> @@ -2528,14 +2551,17 @@ autofdo_source_profile::offline_unrealized_inlines ()
>  
>  function_instance *
>  function_instance::read_function_instance (function_instance_stack *stack,
> -                                        gcov_type head_count)
> +                                        gcov_type head_count, bool toplevel)
>  {
> +  gcov_type_unsigned timestamp = 0;
> +  if (toplevel)
> +    timestamp = (gcov_type_unsigned) gcov_read_counter ();
>    unsigned name = gcov_read_unsigned ();
>    unsigned num_pos_counts = gcov_read_unsigned ();
>    unsigned num_callsites = gcov_read_unsigned ();
>    function_instance *s
>      = new function_instance (name, afdo_string_table->get_filename_idx 
> (name),
> -                          head_count);
> +                          head_count, timestamp);
>    if (!stack->is_empty ())
>      s->set_inlined_to (stack->last ());
>    stack->safe_push (s);
> @@ -2563,7 +2589,7 @@ function_instance::read_function_instance 
> (function_instance_stack *stack,
>      {
>        unsigned offset = gcov_read_unsigned ();
>        function_instance *callee_function_instance
> -          = read_function_instance (stack, -1);
> +     = read_function_instance (stack, -1, false);
>        s->callsites[std::make_pair (offset, callee_function_instance->name 
> ())]
>            = callee_function_instance;
>      }
> @@ -2585,9 +2611,11 @@ autofdo_source_profile::~autofdo_source_profile ()
>  /* For a given DECL, returns the top-level function_instance.  */
>  
>  function_instance *
> -autofdo_source_profile::get_function_instance_by_decl (tree decl) const
> +autofdo_source_profile::get_function_instance_by_decl (tree decl, const char 
> *filename) const
>  {
> -  const char *filename = lbasename (DECL_SOURCE_FILE (decl));
> +  if (!filename)
> +    filename = lbasename (DECL_SOURCE_FILE (decl));
> +
>    int index = afdo_string_table->get_index_by_decl (decl);
>    if (index == -1)
>      return NULL;
> @@ -2829,7 +2857,21 @@ autofdo_source_profile::read ()
>       fatal_error (UNKNOWN_LOCATION,
>                    "auto-profile contains duplicated function instance %s",
>                    afdo_string_table->get_name (s->name ()));
> +
> +      timestamp_info_map.insert({s->timestamp (),
> +                              std::make_pair (s->name (), 0)});
>      }
> +
> +  /* timestamp_info_map is std::map with timestamp as key,
> +     so it's already sorted in ascending order wrt timestamps.
> +     This loop maps function with lowest timestamp to 1, and so on.
> +     In afdo_annotate_cfg, node->tp_first_run is then set to corresponding
> +     tp_first_run value.  */
> +
> +  int tp_first_run = 1;
> +  for (auto &p : timestamp_info_map)
> +    p.second.second = tp_first_run++;
> +
>    int hot_frac = param_hot_bb_count_fraction;
>    /* Scale up the profile, but leave some bits in case some counts gets
>       bigger than sum_max eventually.  */
> @@ -4079,6 +4121,17 @@ afdo_annotate_cfg (void)
>        = afdo_source_profile->get_function_instance_by_decl (
>            current_function_decl);
>  
> +  /* FIXME: This is a workaround for sourcefile tracking, if 
> afdo_string_table
> +     ends up with empty filename or incorrect filename for the function and
> +     should be removed once issues with sourcefile tracking get fixed.  */
> +  if (s == NULL)
> +    for (unsigned i = 0; i < afdo_string_table->filenames ().length (); i++)
> +      {
> +     s = afdo_source_profile->get_function_instance_by_decl 
> (current_function_decl, afdo_string_table->filenames()[i]);
> +     if (s)
> +       break;
> +      }
> +
>    if (s == NULL)
>      {
>        if (dump_file)
> @@ -4087,7 +4140,8 @@ afdo_annotate_cfg (void)
>        /* create_gcov only dumps symbols with some samples in them.
>        This means that we get nonempty zero_bbs only if some
>        nonzero counts in profile were not matched with statements.  */
> -      if (!flag_profile_partial_training)
> +      if (!flag_profile_partial_training
> +       && !flag_auto_profile_reorder_only)
>       {
>         FOR_ALL_BB_FN (bb, cfun)
>           if (bb->count.quality () == GUESSED_LOCAL)
> @@ -4097,6 +4151,20 @@ afdo_annotate_cfg (void)
>        return;
>      }
>  
> +  if (auto it = timestamp_info_map.find (s->timestamp ());
> +      it != timestamp_info_map.end ())
> +    {
> +      cgraph_node *node = cgraph_node::get (current_function_decl);
> +      node->tp_first_run = it->second.second;
> +
> +      if (dump_file)
> +     fprintf (dump_file, "Setting %s->tp_first_run to %d\n",
> +              node->asm_name (), node->tp_first_run);
> +    }
> +
> +  if (flag_auto_profile_reorder_only)
> +    return;
> +
>    calculate_dominance_info (CDI_POST_DOMINATORS);
>    calculate_dominance_info (CDI_DOMINATORS);
>    loop_optimizer_init (0);
> @@ -4496,3 +4564,182 @@ make_pass_ipa_auto_profile_offline (gcc::context 
> *ctxt)
>  {
>    return new pass_ipa_auto_profile_offline (ctxt);
>  }
> +
> +/* Map from name -> tp_first_run for clones. We need this for initiailizing 
> tp_first_run
> +   of clones created after auto-profile pass.  */
> +
> +static std::map<const char *, int, autofdo::string_compare> 
> clone_timestamp_map;

This probably should be symbol_summary, so it stays up to date with
symbol table changes?
> +
> +/* Check if name is a clone function name. LTO privatized clones should 
> already
> +   be handled with sourcefile tracking.  */
> +
> +static bool
> +clone_function_p (const char *name)
> +{
> +  return strchr (name, '.') && !strstr (name, "lto_priv");
> +}
> +
> +/* Gather <name, tp_first_run> value from autofdo::timestamp_info_map.  */
> +
> +static void
> +adjust_tp_first_run_generate_summary (void)
> +{
> +  for (const auto &p : autofdo::timestamp_info_map)
> +    {
> +      gcov_type timestamp = p.first;
> +      int name_id = p.second.first;
> +      int tp_first_run = p.second.second;
> +
> +      if (const char *name = autofdo::afdo_string_table->get_name (name_id);
> +       name && clone_function_p (name))
> +     clone_timestamp_map.insert({xstrdup (name), tp_first_run});
> +    }
> +}
> +
> +/* Stream out <name, tp_first_run>.  */
> +
> +static void
> +adjust_tp_first_run_write_summary (void)
> +{
> +  struct output_block *ob
> +    = create_output_block (LTO_section_ipa_adjust_tp_first_run);
> +
> +  streamer_write_uhwi (ob, clone_timestamp_map.size ());
> +  for (const auto &p : clone_timestamp_map)
> +    {
> +      streamer_write_string (ob, ob->main_stream, p.first, true);
> +      streamer_write_hwi (ob, p.second);
> +    }
> +
> +  produce_asm (ob);
> +  destroy_output_block (ob);
> +}
> +
> +/* Stream in <name, tp_first_run> and put it in clone_timestamp_map.  */
> +
> +static void
> +adjust_tp_first_run_read_summary (void)
> +{
> +  struct lto_file_decl_data **file_data_vec = lto_get_file_decl_data ();
> +  struct lto_file_decl_data *file_data;
> +  unsigned j = 0;
> +
> +  while ((file_data = file_data_vec[j++]))
> +    {
> +      size_t len;
> +      const char *data
> +     = lto_get_summary_section_data (file_data,
> +                                     LTO_section_ipa_adjust_tp_first_run,
> +                                     &len);
> +      if (!data)
> +     continue;
> +
> +      const struct lto_function_header *header
> +     = (const struct lto_function_header *) data;
> +      const int cfg_offset = sizeof (struct lto_function_header);
> +      const int main_offset = cfg_offset + header->cfg_size;
> +      const int string_offset = main_offset + header->main_size;
> +      struct data_in *data_in;
> +
> +      lto_input_block ib ((const char *) data + main_offset,
> +                       header->main_size, file_data);
> +
> +      data_in
> +     = lto_data_in_create (file_data, (const char *) data + string_offset,
> +                           header->string_size, vNULL);
> +
> +      unsigned num_clones = streamer_read_uhwi (&ib);
> +      for (unsigned i = 0; i < num_clones; i++)
> +     {
> +       const char *fn_name = streamer_read_string (data_in, &ib);
> +       int tp_first_run = streamer_read_hwi (&ib);
> +       clone_timestamp_map.insert ({xstrdup (fn_name), tp_first_run});
> +     }
> +
> +      lto_free_section_data (file_data, LTO_section_ipa_adjust_tp_first_run,
> +                          NULL, data, len);
> +      lto_data_in_delete (data_in);
> +    }
> +}
> +
> +/* A late IPA pass for handling clones.
> +   During auto-profile pass, for clones that aren't registered in call graph,
> +   stream <name, tp_first_run> from timestamp_info_map during LGEN.
> +   And during WPA, stream back the info, and assign the field value.
> +   The pass is placed sufficiently late (just before locality_cloning), that
> +   it should have clones registered in the call graph.  */
> +
> +static unsigned
> +adjust_tp_first_run ()
> +{
> +  for (const auto &p : clone_timestamp_map)
> +    if (cgraph_node *node
> +     = cgraph_node::get_for_asmname (get_identifier (p.first)))
> +      {
> +     node->tp_first_run = p.second;
> +     if (dump_file)
> +       fprintf (dump_file, "Setting %s->tp_first_run to %d\n",
> +                node->asm_name (), node->tp_first_run);
> +      }
> +
> +  for (const auto &p : clone_timestamp_map)
> +    free ((void *) p.first);
> +  clone_timestamp_map.clear ();
> +
> +  return 0;
> +}
> +
> +namespace {
> +
> +const pass_data pass_data_ipa_adjust_tp_first_run =
> +{
> +  IPA_PASS, /* type */
> +  "adjust-tp_first_run", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_CGRAPHOPT, /* tv_id */
> +  0, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  0, /* todo_flags_finish */
> +};
> +
> +class pass_ipa_adjust_tp_first_run : public ipa_opt_pass_d
> +{
> +public:
> +  pass_ipa_adjust_tp_first_run (gcc::context *ctxt)
> +    : ipa_opt_pass_d (pass_data_ipa_adjust_tp_first_run, ctxt,
> +                   adjust_tp_first_run_generate_summary, /* generate_summary 
> */
> +                   adjust_tp_first_run_write_summary, /* write_summary */
> +                   adjust_tp_first_run_read_summary, /* read_summary */
> +                   NULL, /* write_optimization_summary */
> +                   NULL, /* read_optimization_summary */
> +                   NULL, /* stmt_fixup */
> +                      0, /* function_transform_todo_flags_start */
> +                   NULL, /* function_transform */
> +                   NULL) /* variable_transform */
> +  {}
> +
> +  /* opt_pass methods: */
> +  bool
> +  gate (function *) final override
> +  {
> +    return flag_auto_profile;
> +  }
> +
> +  unsigned
> +  execute (function *) final override
> +  {
> +    return adjust_tp_first_run ();
> +  }
> +
> +}; // class pass_ipa_adjust_tp_first_run
> +
> +} // anon namespace
> +
> +ipa_opt_pass_d *
> +make_pass_ipa_adjust_tp_first_run (gcc::context *ctxt)
> +{
> +  return new pass_ipa_adjust_tp_first_run (ctxt);
> +}
> +
> diff --git a/gcc/common.opt b/gcc/common.opt
> index f6d93dc05fb..d7c028f87fe 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -1203,6 +1203,10 @@ fauto-profile-inlining
>  Common Var(flag_auto_profile_inlining) Init(1) Optimization
>  Perform inlining using auto-profile.
>  
> +fauto-profile-reorder-only
> +Common Var(flag_auto_profile_reorder_only) Init(0) Optimization
> +Perform only function based reordering with fauto-profile.
> +
>  ; -fcheck-bounds causes gcc to generate array bounds checks.
>  ; For C, C++ and ObjC: defaults off.
>  ; For Java: defaults to on.
> diff --git a/gcc/ipa-locality-cloning.cc b/gcc/ipa-locality-cloning.cc
> index 2684046bd2d..1653ea7b961 100644
> --- a/gcc/ipa-locality-cloning.cc
> +++ b/gcc/ipa-locality-cloning.cc
> @@ -994,6 +994,50 @@ locality_determine_static_order (auto_vec<locality_order 
> *> *order)
>    return true;
>  }
>  
> +/* Similar to lto-partition.cc:partition_by_timestamp.s  */
> +
> +static locality_partition
> +partition_by_timestamps (int64_t max_partition_size, int& npartitions)
> +{
> +  vec<cgraph_node *> cnodes = vNULL;
> +  cgraph_node *node;
> +  FOR_EACH_DEFINED_FUNCTION (node)
> +    if (node->tp_first_run > 0
> +     && node->get_partitioning_class () == SYMBOL_PARTITION
> +     && ipa_size_summaries->get (node))
> +      cnodes.safe_push (node);
> +  cnodes.qsort (tp_first_run_node_cmp);
> +
> +  unsigned n_partitioned_symbols = 0;
> +
> +  locality_partition partition = NULL;
> +  unsigned i;
> +  FOR_EACH_VEC_ELT (cnodes, i, node)
> +    {
> +      if (node_partitioned_p (node))
> +     continue;
> +      ipa_size_summary *size_summary = ipa_size_summaries->get (node);
> +
> +      if (partition == NULL
> +       || partition->insns + size_summary->size > max_partition_size)
> +     partition = create_partition (npartitions);
> +
> +      add_node_to_partition (partition, node);
> +      n_partitioned_symbols++;
> +    }
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file,
> +            "partition_by_timestamps: Total partitions: %d\n", npartitions);
> +      fprintf (dump_file,
> +            "partition_by_timestamps: Number of partitioned symbols: %d\n",
> +            n_partitioned_symbols);
> +    }
> +
> +  return partition;
> +}
> +
>  /* Partitioning for code locality.
>     1. Create and sort callchains.  If PGO is available, use real profile
>     counts.  Otherwise, use a set of heuristics to sort the callchains.
> @@ -1007,7 +1051,7 @@ locality_partition_and_clone (int 
> max_locality_partition_size,
>                             lto_locality_cloning_model cloning_model,
>                             int freq_denominator, int size)
>  {
> -  locality_partition partition;
> +  locality_partition partition = NULL;
>    int npartitions = 0;
>  
>    auto_vec<locality_order *> order;
> @@ -1040,7 +1084,11 @@ locality_partition_and_clone (int 
> max_locality_partition_size,
>    int64_t partition_size
>      = max_locality_partition_size
>       ? max_locality_partition_size : param_max_partition_size;
> -  partition = create_partition (npartitions);
> +
> +  if (flag_auto_profile && flag_profile_reorder_functions)
> +    partition = partition_by_timestamps (partition_size, npartitions);
> +  if (!partition)
> +    partition = create_partition (npartitions);
>  
>    for (unsigned i = 0; i < order.length (); i++)
>      {
> diff --git a/gcc/lto-section-in.cc b/gcc/lto-section-in.cc
> index 1dd9520137a..2e2e30ea400 100644
> --- a/gcc/lto-section-in.cc
> +++ b/gcc/lto-section-in.cc
> @@ -57,6 +57,7 @@ const char *lto_section_name[LTO_N_SECTION_TYPES] =
>    "ipa_sra",
>    "odr_types",
>    "ipa_modref",
> +  "ipa_adjust_tp_first_run",
>  };
>  
>  /* Hooks so that the ipa passes can call into the lto front end to get
> diff --git a/gcc/lto-streamer.h b/gcc/lto-streamer.h
> index 4b7209e3d82..242579918b0 100644
> --- a/gcc/lto-streamer.h
> +++ b/gcc/lto-streamer.h
> @@ -225,6 +225,7 @@ enum lto_section_type
>    LTO_section_ipa_sra,
>    LTO_section_odr_types,
>    LTO_section_ipa_modref,
> +  LTO_section_ipa_adjust_tp_first_run,
>    LTO_N_SECTION_TYPES                /* Must be last.  */
>  };
>  
> diff --git a/gcc/lto/lto-partition.cc b/gcc/lto/lto-partition.cc
> index c7e69ee3161..c7937107ad7 100644
> --- a/gcc/lto/lto-partition.cc
> +++ b/gcc/lto/lto-partition.cc
> @@ -87,6 +87,15 @@ new_partition (const char *name)
>    return part;
>  }
>  
> +/* Create and return the created partition of name NAME.  */
> +
> +static ltrans_partition
> +create_partition (int &npartitions, const char *name)
> +{
> +  npartitions++;
> +  return new_partition (name);
> +}
> +
>  /* Free memory used by ltrans partition.
>     Encoder can be kept to be freed after streaming.  */
>  static void
> @@ -1017,6 +1026,50 @@ lto_cache_map (int n_lto_partitions, int 
> max_partition_size)
>    partitioner.apply (partitions, n_lto_partitions);
>  }
>  
> +/* Place symbols that are profiled in as few partitions as possible.  */
> +
> +static ltrans_partition
> +partition_by_timestamps (int max_partition_size, int &npartitions)
> +{
> +  vec<cgraph_node *> cnodes = vNULL;
> +  cgraph_node *node;
> +  FOR_EACH_DEFINED_FUNCTION (node)
> +    if (node->tp_first_run > 0
> +     && node->get_partitioning_class () == SYMBOL_PARTITION
> +     && ipa_size_summaries->get (node))
> +      cnodes.safe_push (node);
> +  cnodes.qsort (tp_first_run_node_cmp);
> +
> +  unsigned n_partitioned_symbols = 0;
> +
> +  ltrans_partition partition = NULL;
> +  unsigned i;
> +  FOR_EACH_VEC_ELT (cnodes, i, node)
> +    {
> +      if (symbol_partitioned_p (node))
> +     continue;
> +
> +      ipa_size_summary *size_summary = ipa_size_summaries->get (node);
> +      if (partition == NULL
> +       || partition->insns + size_summary->size > max_partition_size)
> +     partition = create_partition (npartitions, "");
> +
> +      add_symbol_to_partition (partition, node);
> +      n_partitioned_symbols++;
> +    }
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file,
> +            "partition_by_timestamps: Total partitions: %d\n", npartitions);
> +      fprintf (dump_file,
> +            "partition_by_timestamps: Number of partitioned symbols: %d\n",
> +            n_partitioned_symbols);
> +    }
> +
> +  return partition;
> +}
> +
>  /* Group cgraph nodes into equally-sized partitions.
>  
>     The partitioning algorithm is simple: nodes are taken in predefined order.
> @@ -1072,7 +1125,7 @@ lto_balanced_map (int n_lto_partitions, int 
> max_partition_size)
>    int64_t cost = 0, internal = 0;
>    unsigned int best_n_nodes = 0, best_i = 0;
>    int64_t best_cost = -1, best_internal = 0, best_size = 0;
> -  int npartitions;
> +  int npartitions = 0;
>    int current_order = -1;
>    int noreorder_pos = 0;
>  
> @@ -1092,6 +1145,19 @@ lto_balanced_map (int n_lto_partitions, int 
> max_partition_size)
>  
>    original_total_size = total_size;
>  
> +  /* With -fauto-profile -fprofile-reorder-functions, place all symbols which
> +     are profiled together into as few partitions as possible. The rationale
> +     for doing this with AutoFDO is that the number of sampled functions is a
> +     fraction of total number of executed functions (unlike PGO where each
> +     executed function gets instrumented with time profile counter), and
> +     placing them together helps with code locality.  */
> +
> +  partition = NULL;
> +  if (flag_auto_profile
> +      && flag_profile_reorder_functions
> +      && noreorder.length () == 0)
> +    partition = partition_by_timestamps (max_partition_size, npartitions);
> +
>    /* Streaming works best when the source units do not cross partition
>       boundaries much.  This is because importing function from a source
>       unit tends to import a lot of global trees defined there.  We should
> @@ -1126,8 +1192,12 @@ lto_balanced_map (int n_lto_partitions, int 
> max_partition_size)
>    partition_size = total_size / n_lto_partitions;
>    if (partition_size < param_min_partition_size)
>      partition_size = param_min_partition_size;
> -  npartitions = 1;
> -  partition = new_partition ("");
> +
> +  if (npartitions == 0)
> +    {
> +      npartitions = 1;
> +      partition = new_partition ("");
> +    }
>    if (dump_file)
>      fprintf (dump_file, "Total unit size: %" PRId64 ", partition size: %" 
> PRId64 "\n",
>            total_size, partition_size);
> @@ -1474,15 +1544,6 @@ add_node_references_to_partition (ltrans_partition 
> partition, symtab_node *node)
>  
>  }
>  
> -/* Create and return the created partition of name NAME.  */
> -
> -static ltrans_partition
> -create_partition (int &npartitions, const char *name)
> -{
> -  npartitions++;
> -  return new_partition (name);
> -}
> -
>  /* Partitioning for code locality.
>     The partitioning plan (and prerequisite cloning) will have been done by 
> the
>     IPA locality cloning pass.  This function just implements that plan by
> diff --git a/gcc/opts.cc b/gcc/opts.cc
> index 10ce2c3de33..a5a9700dc08 100644
> --- a/gcc/opts.cc
> +++ b/gcc/opts.cc
> @@ -3179,7 +3179,13 @@ common_handle_option (struct gcc_options *opts,
>        /* No break here - do -fauto-profile processing. */
>        /* FALLTHRU */
>      case OPT_fauto_profile:
> -      enable_fdo_optimizations (opts, opts_set, value, true);
> +      if (opts->x_flag_auto_profile_reorder_only)
> +     {
> +       opts->x_flag_profile_reorder_functions = true;
> +       opts->x_flag_auto_profile_inlining = false;
> +     }
> +      else
> +     enable_fdo_optimizations (opts, opts_set, value, true);
>        SET_OPTION_IF_UNSET (opts, opts_set, flag_profile_correction, value);
>        break;
>  
> diff --git a/gcc/passes.def b/gcc/passes.def
> index 3f828477b68..bd0e80dfa5f 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -171,6 +171,7 @@ along with GCC; see the file COPYING3.  If not see
>    NEXT_PASS (pass_ipa_sra);
>    NEXT_PASS (pass_ipa_fn_summary);
>    NEXT_PASS (pass_ipa_inline);
> +  NEXT_PASS (pass_ipa_adjust_tp_first_run);
>    NEXT_PASS (pass_ipa_locality_cloning);
>    NEXT_PASS (pass_ipa_pure_const);
>    NEXT_PASS (pass_ipa_modref);
> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> index 61cec52c624..8544a6bc552 100644
> --- a/gcc/tree-pass.h
> +++ b/gcc/tree-pass.h
> @@ -553,6 +553,7 @@ extern ipa_opt_pass_d *make_pass_ipa_cdtor_merge 
> (gcc::context *ctxt);
>  extern ipa_opt_pass_d *make_pass_ipa_single_use (gcc::context *ctxt);
>  extern ipa_opt_pass_d *make_pass_ipa_comdats (gcc::context *ctxt);
>  extern ipa_opt_pass_d *make_pass_ipa_modref (gcc::context *ctxt);
> +extern ipa_opt_pass_d *make_pass_ipa_adjust_tp_first_run (gcc::context 
> *ctxt);
>  extern ipa_opt_pass_d *make_pass_ipa_locality_cloning (gcc::context *ctxt);
>  
>  extern gimple_opt_pass *make_pass_cleanup_cfg_post_optimizing (gcc::context
> diff --git a/gcc/varasm.cc b/gcc/varasm.cc
> index 0d78f5b384f..34673a970ea 100644
> --- a/gcc/varasm.cc
> +++ b/gcc/varasm.cc
> @@ -610,6 +610,15 @@ default_function_section (tree decl, enum node_frequency 
> freq,
>    if (!flag_reorder_functions
>        || !targetm_common.have_named_sections)
>      return NULL;
> +
> +  /* With -fauto-profile -fprofile-reorder-functions, give higher priority
> +     to time profile based reordering, and ensure the reordering isn't split
> +     by hot/cold partitioning.  */
> +  if (flag_auto_profile
> +      && flag_profile_reorder_functions
> +      && cgraph_node::get (decl)->tp_first_run > 0)
> +    return NULL;
> +
>    /* Startup code should go to startup subsection unless it is
>       unlikely executed (this happens especially with function splitting
>       where we can split away unnecessary parts of static constructors.  */

Reply via email to