> Hi Honza,
> Thanks for the review, please find responses inline below.
>
> > I assume that the autofdo assigns timestamps to offile function
> > instances, so this can also fit into function_instance datastructure,
> > but I also suppose you want to keep it separate so it is optional?
> function_instance does store timestamp, and uses it as a key into
> timestamp_info_map.
> My intent of keeping a separate data-structure timestamp_info_map was to sort
> 64-bit timestamps into ascending order,
> and then map the sorted timestamp values to (1..N), and assign the mapped
> value to node->tp_first_run.
> So we don't need to up size of node->tp_first_run to 64bit (and I guess we
> won't need the full 64-bit range here?)
Normal time profiling works in a way that time is increased only when
new function is executed, so 2^32 is the limit on number of executed
functions in the program which is probably still large enough: if one
does LTO with more than 2^32 symbols we will run out ot DECL_UIDs and
other things, though it may be possible to link such a large program.
At runtime, all computations and streaming happens in 64bit, so updating
to_first_run into 64bit is also alternative, but I guess the compression
is a reasonable approach until we will need to update other
datastructures as well.
> > Also does the fature work with normal partitioning? It should make
> > those function with tp_first_run nonzero go into early partitions, but
> > then we will likely get relatively lot of calls to functions with zero
> > counts just because we lost track of them?
> The results I reported have been with balanced partitioning. I am seeing
> roughly the same numbers on top of locality partitioning too.
Great, happy that this works on both settings.
> >
> > I also wonder if offlining can behave meaningfully here. I.e. when
> > function profile is non-zero just copy first run from the caller.
> IIUC, the purpose of afdo_offline pass is to remove cross module inlining
> from function_instances so AutoFDO annotation
> gets a better mapping from source locations to CFG ?
Yes, old code assumed all relevant inlining to happen at early inline
time via the afdo-inline path. All inlining that did not happen led to
lost profile data. With LTO this becomes quite problematic since a lot
of inlining is cross-module, so we need to take inline instances out and
merge them with existing offline instances in order to get realistic
profile before the link-time IPA-profile pass.
>
> Currently, the patch only assigns timestamps to functions whose outline copy
> is executed during train run (toplevel).
> Would afdo offlining help for instance if a function is inlined into all it's
> callers during train run (thus the outline copy has no timestamp info),
> but during AutoFDO, it may not get inlined into all it's callers ? Currently
> in this case, I guess we lose timestamp info (and thus time profile
> reordering for such a function).
>
> Currently the transform is gated explicitly on -fauto-profile
> -fprofile-reorder-functions (so users can opt-in instead of being enabled by
> default with -fauto-profile).
> Should that be OK for gcc-16 ?
I think we want to do that (also for gcc-16, since this is contained in
autofdo). If we offline a function with a a non-zero
counts in it, we may want to assign the offline copy first run timestamp
derived from the outer function. (Probably something like
timestamp+inline depth).
> Does the patch look OK if bootstrap+test passes (with and without AutoFDO) ?
> gcc/ChangeLog:
> * auto-profile.cc: (string_table::filenames): New method.
> (function_instance::timestamp_): New member.
> (function_instance::timestamp): New accessor for timestamp_ member.
> (autofdo::timestamp_info_map): New std::map.
> (function_instance::function_instance): Add argument timestamp and set
> it to member timestamp_.
> (function_instance::read_function_instance): Adjust prototype and read
> timestamp.
> (autofdo_source_profile::get_function_instance_by_decl): New argument
> filename with default value NULL.
> (autofdo_source_profile::read): Populate timestamp_info_map.
> (afdo_annotate_cfg): Assign node->tp_first_run based on
> timestamp_info_map and bail out of annotation if
> param auto-profile-reorder-only is enabled.
> * params.opt: New param auto-profile-reorder-only.
> * ipa-locality-cloning.cc (partition_by_timestamps): New function.
> (locality_partition_and_clone): Call partition_by_timestamps.
> * varasm.cc (default_function_section): Bail out if -fauto-profile
> -fprofile-reorder-functions is enabled and node->tp_first_run > 0.
>
> gcc/lto/ChangeLog:
> * lto-partition.cc (create_partition): Move higher up in the file.
> (partition_by_timestamps): New function.
> (lto_balanced_map): Call partition_by_timestamps if -fauto-profile
> -fprofile-reorder-functions is passed and noreroder is empty.
OK
> + /* perf timestamp associated with first execution of function. */
Please add comment that tp_first_run is eventually computed using this
value, but we need to watch overflows since it is 32bit.
> + gcov_type timestamp_;
> +/* Map from timestamp -> <name, tp_first_run>.
> +
> +The purpose of this map is to map 64-bit timestamp values to (1..N),
> +sorted by ascending order of timestamps and assign that to
> node->tp_first_run,
> +since we don't need the full 64-bit range. */
Seems the comment is formatted wrongly.
> +static std::map<gcov_type, std::pair<int, int>> timestamp_info_map;
> +
> /* Scaling factor for afdo data. Compared to normal profile
> AFDO profile counts are much lower, depending on sampling
> frequency. We scale data up to reudce effects of roundoff
> @@ -2523,6 +2543,7 @@ autofdo_source_profile::offline_unrealized_inlines ()
> /* function instance profile format:
>
> ENTRY_COUNT: 8 bytes
> + TIMESTAMP: 8 bytes (only for toplevel symbols)
> NAME_INDEX: 4 bytes
> NUM_POS_COUNTS: 4 bytes
> NUM_CALLSITES: 4 byte
> @@ -2549,14 +2570,17 @@ autofdo_source_profile::offline_unrealized_inlines ()
>
> function_instance *
> function_instance::read_function_instance (function_instance_stack *stack,
> - gcov_type head_count)
> + gcov_type head_count, bool toplevel)
There are also head count that is also streamed for toplevel instances
that is handled by passing it around as a parameter:
function_instance *s = function_instance::read_function_instance (
&stack, gcov_read_counter ());
I think "toplevel" flag is better, so please modify streaming of head
count to work same way.
> = new function_instance (name, afdo_string_table->get_filename_idx
> (name),
> - head_count);
> + head_count, timestamp);
We will need setter api anyway to handle update while offlining, so
perhaps instead of adding extra parameter of the ctor, just set
timestamp after contruction?
> + /* timestamp_info_map is std::map with timestamp as key,
> + so it's already sorted in ascending order wrt timestamps.
> + This loop maps function with lowest timestamp to 1, and so on.
> + In afdo_annotate_cfg, node->tp_first_run is then set to corresponding
> + tp_first_run value. */
> +
> + int tp_first_run = 1;
> + for (auto &p : timestamp_info_map)
> + p.second.second = tp_first_run++;
For time being I guess you can walk inline instances here recursively
and assign tp_first_runs to all of those that contains non-zero counts.
Then offlining only needs to merge the timestamps at a time it merges
the profile.
> diff --git a/gcc/ipa-locality-cloning.cc b/gcc/ipa-locality-cloning.cc
> index 2684046bd2d..1653ea7b961 100644
> --- a/gcc/ipa-locality-cloning.cc
> +++ b/gcc/ipa-locality-cloning.cc
The changes above are OK with the changes mentioned.
The changes below can still be a separate patch (once tp_first_run is
set, we should get sane partitioning and ordering by existing balanced
partitioning code).
> @@ -994,6 +994,50 @@ locality_determine_static_order (auto_vec<locality_order
> *> *order)
> return true;
> }
>
> +/* Similar to lto-partition.cc:partition_by_timestamp.s */
> +
> +static locality_partition
> +partition_by_timestamps (int64_t max_partition_size, int& npartitions)
It is not called timestamps anymore, so perhaps parittion_by_tp_first_run.
> @@ -1134,6 +1187,19 @@ lto_balanced_map (int n_lto_partitions, int
> max_partition_size)
>
> original_total_size = total_size;
>
> + /* With -fauto-profile -fprofile-reorder-functions, place all symbols which
> + are profiled together into as few partitions as possible. The rationale
> + for doing this with AutoFDO is that the number of sampled functions is a
> + fraction of total number of executed functions (unlike PGO where each
> + executed function gets instrumented with time profile counter), and
> + placing them together helps with code locality. */
> +
> + partition = NULL;
> + if (flag_auto_profile
> + && flag_profile_reorder_functions
> + && noreorder.length () == 0)
> + partition = partition_by_timestamps (max_partition_size, npartitions);
I do not see why this is needed. Balanced parititioning starts by a
fixed order of functions:
order.qsort (tp_first_run_node_cmp);
which will prioritize functions with tp_first_run and
flag_profile_reodr_functions enabled (this flag is per-function so it
needs to be checked on each of them)
int
tp_first_run_node_cmp (const void *pa, const void *pb)
{
const cgraph_node *a = *(const cgraph_node * const *) pa;
const cgraph_node *b = *(const cgraph_node * const *) pb;
unsigned int tp_first_run_a = a->tp_first_run;
unsigned int tp_first_run_b = b->tp_first_run;
if (!opt_for_fn (a->decl, flag_profile_reorder_functions)
|| a->no_reorder)
tp_first_run_a = 0;
if (!opt_for_fn (b->decl, flag_profile_reorder_functions)
|| b->no_reorder)
tp_first_run_b = 0;
if (tp_first_run_a == tp_first_run_b)
return a->order - b->order;
/* Functions with time profile must be before these without profile. */
tp_first_run_a = (tp_first_run_a - 1) & INT_MAX;
tp_first_run_b = (tp_first_run_b - 1) & INT_MAX;
return tp_first_run_a - tp_first_run_b;
}
Once order is set, balanced partitioning assigns functions to partitiong
in that order only trying to minimize the interface between the parts by
looking for better cut points in a given range.
So it should give similar or better order than what you get with first
putting non-zero tp first run into partitions using separate
partitioner.
> +
> + /* With -fauto-profile -fprofile-reorder-functions, give higher priority
> + to time profile based reordering, and ensure the reordering isn't split
> + by hot/cold partitioning. */
> + if (flag_auto_profile
> + && flag_profile_reorder_functions
> + && cgraph_node::get (decl)->tp_first_run > 0)
> + return NULL;
I am also not sure why this would be needed. If you have cold function
that is run early (such as startup initialization code) it still makes
sense to place it away from code that is run often.
Honza