Re: [PATCH] [RFC][AutoFDO] Implement summary information in auto-profile

Jan Hubicka Sat, 18 Oct 2025 14:47:48 -0700

> 
> Hi Honza,
> 
> The assumption was that streaming the statistics in from the file would be
> lesser overhead than computing them while reading in the counters. If that
> isn't the case, then it is better not to have them.


Computing the cutoff is linear time algorithm, so we can do it cheaply
enough while stream in: push all counts into one large array and use
quickselect. However we need to stream in all the data which is
inherently nonlinear (#invocations of compiler * #of samples) and I
think not scalable enough.
> 
> Yes, that is a good idea. But as far as I can see, LLVM does read the 
> entirety of
> the file while streaming the profile information (at least going off of the
> SampleProfileReaderBinary::readImpl () implementation).

Andi is right that current auto profile reader ignores fact that gcov
file format has a well defined sections and mechanism to add new ones
without breaking old tools (so it is equivalent of llvm's extBinary; I
do not see need for two binary formats).  I will fix that incrementally.

Currently dump-gcov is confused about autoprofile "gcov" so I guess the
file format is not implemented correctly.  We also want the reader to
skip unknown section & do some sanity checking.

Given current readers we indeed need to bump up the gcov version.

I checked SampleProfileReaderExtBinaryBase::readOneSection.
It has SecProfSummary which is what you added, SecNameTable we have,
SecLBRProfile which seems to be the actual profile counts,
SecFuncOffsetTable which seems to be the helper to allow random access
to function profiles (which we miss but will need to add to scale to
very large projects), SecFuncMetadata which record some info about
pofile (i.e. are unique names used),  SecProfileSymbolList I did not
look into and ability to skip "custom" sections.
> 
> I think what we want is something like the SampleProfileReaderExtBinaryBase
> implementation which uses a section header and is designed to be extensible.

Yes, it seems current code is half way there.  While it attemts to make
actual file format extensible and compatible with normal gcov tools, the
reader is broken.
> 
> Yeah, I had designed this following the implementation in LLVM which uses both
> hot and cold counts, and potentially calls with different percentiles 
> (instead of a
> global threshold) which requires caching. I wasn't sure if there were places 
> in
> GCC where this is done so this was a bit of an open-ended decision.

We have three types of counts
 - hot: those above threshold 
 - likely never executed: Thos where we can prove that they are likely
   not executed. I.e. code leading to abort ()
 - rest
We can subdivide rest but so far I did not see good motivation for that.
> 
> > 
> > With LTO GCC has analogous functionality in ipa-profile.cc:ipa_profile
> > which computes histogram of all basic block counts in the program
> > weighted by its expected execution counts and computes hot BB cutoff
> > based on param_hot_bb_count_ws_permille.
> > We consider hot counts, cold counts and probably never executed counts
> > (which is 0 count from porfile feedback or static estimation).
> > 
> > So I would drop --param afdo-profile-summary-cutoff-hotcold_count and
> > use --param hot-bb-count-ws-permille for afdo as well.
> 
> The current design streams out 16 hardcoded percentiles to the GCOV file,
> then looks up the closest percentile to the --param. Would it be better then
> to just calculate one percentile (based on hot-bb-count-ws-permille) while
> streaming in the information? This would remove the need to modify the GCOV
> format at all, if the other summary items can also be cheaply calculated.

I think we can go with your changes and incrementally implement the
function offset table.  We need a table that links symbol name index to
all its function instances (offline and inline).  This seems similar to
FuncOffsetTable.

So the read in can be organized by streaming in the symbol name table,
deciding what bodies we need and then stream them skipping bodies we do
not need.

I tried to measure clang's overhead for reading whole profile, but
did not managed to convince clang to read other file format then
extbinary. However running
 llvm-profdata merge -o a llvm.profdata
takes 1.2s and 
 llvm-profdata overlap llvm.profdata llvm.profdata
takes 0.55s

llvm-profdata merge --binary and --extbinary produces same data, so I
guess llvm folks replaced binary format by extbinary and binary
readers/writters are now kind of dead.

It seems reading whole profile takes around 0.2-0.3s which is still a
lot when building small files (and would get worse with bigger training
run then one I used).

Honza

Re: [PATCH] [RFC][AutoFDO] Implement summary information in auto-profile

Reply via email to