> > Hi Honza, > > The assumption was that streaming the statistics in from the file would be > lesser overhead than computing them while reading in the counters. If that > isn't the case, then it is better not to have them.
Computing the cutoff is linear time algorithm, so we can do it cheaply enough while stream in: push all counts into one large array and use quickselect. However we need to stream in all the data which is inherently nonlinear (#invocations of compiler * #of samples) and I think not scalable enough. > > Yes, that is a good idea. But as far as I can see, LLVM does read the > entirety of > the file while streaming the profile information (at least going off of the > SampleProfileReaderBinary::readImpl () implementation). Andi is right that current auto profile reader ignores fact that gcov file format has a well defined sections and mechanism to add new ones without breaking old tools (so it is equivalent of llvm's extBinary; I do not see need for two binary formats). I will fix that incrementally. Currently dump-gcov is confused about autoprofile "gcov" so I guess the file format is not implemented correctly. We also want the reader to skip unknown section & do some sanity checking. Given current readers we indeed need to bump up the gcov version. I checked SampleProfileReaderExtBinaryBase::readOneSection. It has SecProfSummary which is what you added, SecNameTable we have, SecLBRProfile which seems to be the actual profile counts, SecFuncOffsetTable which seems to be the helper to allow random access to function profiles (which we miss but will need to add to scale to very large projects), SecFuncMetadata which record some info about pofile (i.e. are unique names used), SecProfileSymbolList I did not look into and ability to skip "custom" sections. > > I think what we want is something like the SampleProfileReaderExtBinaryBase > implementation which uses a section header and is designed to be extensible. Yes, it seems current code is half way there. While it attemts to make actual file format extensible and compatible with normal gcov tools, the reader is broken. > > Yeah, I had designed this following the implementation in LLVM which uses both > hot and cold counts, and potentially calls with different percentiles > (instead of a > global threshold) which requires caching. I wasn't sure if there were places > in > GCC where this is done so this was a bit of an open-ended decision. We have three types of counts - hot: those above threshold - likely never executed: Thos where we can prove that they are likely not executed. I.e. code leading to abort () - rest We can subdivide rest but so far I did not see good motivation for that. > > > > > With LTO GCC has analogous functionality in ipa-profile.cc:ipa_profile > > which computes histogram of all basic block counts in the program > > weighted by its expected execution counts and computes hot BB cutoff > > based on param_hot_bb_count_ws_permille. > > We consider hot counts, cold counts and probably never executed counts > > (which is 0 count from porfile feedback or static estimation). > > > > So I would drop --param afdo-profile-summary-cutoff-hotcold_count and > > use --param hot-bb-count-ws-permille for afdo as well. > > The current design streams out 16 hardcoded percentiles to the GCOV file, > then looks up the closest percentile to the --param. Would it be better then > to just calculate one percentile (based on hot-bb-count-ws-permille) while > streaming in the information? This would remove the need to modify the GCOV > format at all, if the other summary items can also be cheaply calculated. I think we can go with your changes and incrementally implement the function offset table. We need a table that links symbol name index to all its function instances (offline and inline). This seems similar to FuncOffsetTable. So the read in can be organized by streaming in the symbol name table, deciding what bodies we need and then stream them skipping bodies we do not need. I tried to measure clang's overhead for reading whole profile, but did not managed to convince clang to read other file format then extbinary. However running llvm-profdata merge -o a llvm.profdata takes 1.2s and llvm-profdata overlap llvm.profdata llvm.profdata takes 0.55s llvm-profdata merge --binary and --extbinary produces same data, so I guess llvm folks replaced binary format by extbinary and binary readers/writters are now kind of dead. It seems reading whole profile takes around 0.2-0.3s which is still a lot when building small files (and would get worse with bigger training run then one I used). Honza
