Re: [PATCH] [RFC][AutoFDO] Implement summary information in auto-profile

Dhruv Chawla Fri, 17 Oct 2025 22:51:01 -0700

On 13/10/25 14:09, Jan Hubicka wrote:

External email: Use caution opening links or attachments



Hello.

+/* Store the summary information for the profile.  */
+struct summary_info
+{
+  struct detailed_summary
+  {
+    uint32_t cutoff;
+    uint64_t min_count;
+    uint64_t num_counts;
+  };
+
+  uint64_t total_count;
+  uint64_t max_count;
+  uint64_t max_function_count;
+  uint64_t num_counts;
+  uint64_t num_functions;
+  auto_vec<detailed_summary> detailed_summaries;


The summary definitely definitely needs a description of all the fields.
If I get it right, the vector of detailed summary is a histogram of
counts present in the profile, while the rest of structure contains
simple statistics of them.

In the present form, I am not quite sure what is buys in practice, since
the file format is organized in a way that GCC needs to read all the
counters each invocation and thus it can also compute the statistics
during stream-in.


Hi Honza,

The assumption was that streaming the statistics in from the file would be
lesser overhead than computing them while reading in the counters. If that
isn't the case, then it is better not to have them.


There is important streaming overhead. AFDO profile of clang compiling
tramp3d (so very small train run) is 82MB when using create_gcov based
on GCC built binary and 168MB when using llvm's equivalent (building
clang by clang and using create_llvm_prof)

Now I get
  - 0.003795 second per invocation of gcc -S t.c
  - 0.946906 second per invocation of gcc -S t.c -fauto-profile=llvm.gcov
  - 0.011225 second per invocation of clang -S t.c
  - 0.012016 second per invocation of clang -S t.c 
-fprofile-instr-use=llvm.profdata

1s per compiler invocation is really bad and slows down real world build
quite dramatically. While datastructures are bad and can be fixed, clang
is clearly not reading whole file each invocation.

So we will need to add a section linking symbol names to offsets of
their instances and allow random access to the file.  With this
streaming summary makes sense.


Yes, that is a good idea. But as far as I can see, LLVM does read the entirety 
of
the file while streaming the profile information (at least going off of the
SampleProfileReaderBinary::readImpl () implementation).

I think what we want is something like the SampleProfileReaderExtBinaryBase
implementation which uses a section header and is designed to be extensible.

+  std::map<uint32_t, unsigned> percentile_idx_map;

This seems to be there only to cache get_threshold_count results which
is called twice

+uint64_t
+min_afdo_hot_count (uint32_t cutoff)
+{
+  return autofdo::afdo_summary_info->get_threshold_count (cutoff);
+}
+
+/* Get the maximum count needed to be considered cold for the percentile.  */
+
+uint64_t
+max_afdo_cold_count (uint32_t cutoff)
+{
+  return autofdo::afdo_summary_info->get_threshold_count (cutoff);
+}

in these two functions where only first one is invoked.


Yeah, I had designed this following the implementation in LLVM which uses both
hot and cold counts, and potentially calls with different percentiles (instead 
of a
global threshold) which requires caching. I wasn't sure if there were places in
GCC where this is done so this was a bit of an open-ended decision.


With LTO GCC has analogous functionality in ipa-profile.cc:ipa_profile
which computes histogram of all basic block counts in the program
weighted by its expected execution counts and computes hot BB cutoff
based on param_hot_bb_count_ws_permille.
We consider hot counts, cold counts and probably never executed counts
(which is 0 count from porfile feedback or static estimation).

So I would drop --param afdo-profile-summary-cutoff-hotcold_count and
use --param hot-bb-count-ws-permille for afdo as well.


The current design streams out 16 hardcoded percentiles to the GCOV file,
then looks up the closest percentile to the --param. Would it be better then
to just calculate one percentile (based on hot-bb-count-ws-permille) while
streaming in the information? This would remove the need to modify the GCOV
format at all, if the other summary items can also be cheaply calculated.

+
  namespace
  {

diff --git a/gcc/auto-profile.h b/gcc/auto-profile.h
index 3cce5f2152c..66ef20419e8 100644
--- a/gcc/auto-profile.h
+++ b/gcc/auto-profile.h
@@ -42,4 +42,10 @@ extern gcov_type afdo_hot_bb_threshold;
  /* Return true if COUNT is possibly hot.  */
  extern bool maybe_hot_afdo_count_p (profile_count count);

+/* Get the minimum count needed to achieve the hot threshold percentile.  */
+extern uint64_t min_afdo_hot_count (uint32_t cutoff);
+
+/* Get the maximum count needed to be considered cold for the percentile.  */
+extern uint64_t max_afdo_cold_count (uint32_t cutoff);


I do not think we need these two functions. Auto-profile already sets
hot count threshold so it can use --param hot-bb-count-ws-permille
instead of current heuristics.

+
  #endif /* AUTO_PROFILE_H */
diff --git a/gcc/c/Make-lang.in b/gcc/c/Make-lang.in
index f09fc99467b..95a7dad8350 100644
--- a/gcc/c/Make-lang.in
+++ b/gcc/c/Make-lang.in
@@ -102,7 +102,7 @@ create_fdas_for_cc1: ../stage1-gcc/cc1$(exeext) 
../prev-gcc/$(PERF_DATA)
         echo $$perf_path; \
         if [ -f $$perf_path ]; then \
           profile_name=cc1_$$component_in_prev.fda; \
-         $(CREATE_GCOV) -binary ../stage1-gcc/cc1$(exeext) -gcov 
$$profile_name -profile $$perf_path -gcov_version 2 || exit 1; \
+         $(CREATE_GCOV) -binary ../stage1-gcc/cc1$(exeext) -gcov 
$$profile_name -profile $$perf_path -gcov_version 3 || exit 1; \


While we want to bump up gcov version to represent 32bit discriminators,
the gcov file format is intended to be extensible. Since you are just
adding new section I think everything should just work if create_gcov is
updated to add it to version 2.  (i.e. older GCC versions should ignore
it). Does that work?  In the past we did not really need it to work
since gcov format is closely tied with the GCC version (and choice of
early opts) but now I think it is quite useful to get this working since
autofdo profiles are indended to be stable wrt GCC or source code
changes.

Rest of patch looks OK, but can you please drop the params and version
bump and verify that things works as expected?


As Andi pointed out, the auto-profile pass is quite unsophisticated in the
way it streams information in from the file and it does not know how to
ignore a section of the file. I think the only way to get this working is
to put the summary section at the end of the file, then it will get ignored
automatically (which is probably the only way to avoid the version bump).


Thanks,
Honza


--
Regards,
Dhruv

Re: [PATCH] [RFC][AutoFDO] Implement summary information in auto-profile

Reply via email to