> On 2025-06-06 12:42, Jan Hubicka wrote:
> > > Hi,
> > > also after fixing this issue my bootstrap failes with:
> > > 
> > > Permission error mapping pages.
> > > Consider increasing /proc/sys/kernel/perf_event_mlock_kb,
> > > or try again with a smaller value of -m/--mmap_pages.
> > > (current value: 4294967295,0)
> > > Permission error mapping pages.
> > > Consider increasing /proc/sys/kernel/perf_event_mlock_kb,
> > > or try again with a smaller value of -m/--mmap_pages.
> > > (current value: 4294967295,0)
> > > Permission error mapping pages.
> > > Consider increasing /proc/sys/kernel/perf_event_mlock_kb,
> > > or try again with a smaller value of -m/--mmap_pages.
> > > (current value: 4294967295,0)
> > > 
> > > this happens on my setup when perf is run multiple times in parallel
> > > and bootstrapping without -j256 indeed continues (but slowly).
> > > Since perf record overwrites the previous perf data, I wonder if this
> > > also implies that we lose the profile info due to race conditions?
> 
> There should be only one per obj directory?

Hmm, next time I will autoprofiledboostrap I will see how many perf
invocations are there.

BTW Why that error happens?
> 
> > One extra nit I noticed.  We use
> > AUTO_PROFILE = gcc-auto-profile --all -c 10000000
> > 
> > I do not see a reason to also profile kernel (i.e. --all flag).
> 
> I thought it was to work around some issue but can't recall details. We
> should try I suppose. Also really need to enable all languages. Autofdo
> bootstrap has various strange workarounds that are likely obsolete.
> 
> > Also the count should probably be a prime so we reduce chance getting
> > an interference with a loop?
> 
> On Intel with PEBS it's +1 anyways, so more prime like.
> 
> > 
> > I wonder if we can rely on perf having resonable defaults these days.
> 
> We already do. What's missing is that the latest Intel cores have an
> architectural lbr inserts events that is especially for this and real hybrid
> CPU support.
> 
> > For some CPUs we measure cycles in other cases taken branches.
> 
> Cycles is always a bad idea, you really want branches.

I do 
        if perf list br_inst_retired | grep -q br_inst_retired.near_taken ; then
            E=br_inst_retired.near_taken:p
        elif perf list ex_ret_brn_tkn | grep -q ex_ret_brn_tkn ; then
            E=ex_ret_brn_tkn:P$FLAGS
        elif 
in gcc-auto-profile for Zen3+
I hope ex_ret_brn_tkn:P should be equivalent of intel's
br_inst_retired.near_taken:p.

However aarch64/gcc-auto-profile does not seem to select any event, so I
think it counts cycles?

I do get various complains out of create_gcov
[WARNING:/home/jh/autofdo/third_party/perf_data_converter/src/quipper/perf_reader.cc:1322]
 Skipping 4228 bytes of metadata: HEADER_CPU_TOPOLOGY
[WARNING:/home/jh/autofdo/third_party/perf_data_converter/src/quipper/perf_reader.cc:1069]
 Skipping unsupported event PERF_RECORD_ID_INDEX
[WARNING:/home/jh/autofdo/third_party/perf_data_converter/src/quipper/perf_reader.cc:1069]
 Skipping unsupported event PERF_RECORD_EVENT_UPDATE
[WARNING:/home/jh/autofdo/third_party/perf_data_converter/src/quipper/perf_reader.cc:1069]
 Skipping unsupported event PERF_RECORD_CPU_MAP
[WARNING:/home/jh/autofdo/third_party/perf_data_converter/src/quipper/perf_reader.cc:1069]
 Skipping unsupported event UNKNOWN_EVENT_82
[INFO:/home/jh/autofdo/third_party/perf_data_converter/src/quipper/perf_reader.cc:1060]
 Number of events stored: 398
[INFO:/home/jh/autofdo/third_party/perf_data_converter/src/quipper/perf_parser.cc:272]
 Parser processed: 7 MMAP/MMAP2 events, 2 COMM events, 0 FORK events, 1 EXIT 
events, 385 SAMPLE events, 370 of these were mapped, 0 SAMPLE events with a 
data address, 0 of these were mapped
WARNING: Logging before InitGoogleLogging() is written to STDERR     
I20250605 21:21:32.673207 629982 sample_reader.cc:289] No buildid found in 
binary
W20250605 21:21:32.673300 629982 sample_reader.cc:345] Bogus LBR data (range is 
negative): 7eaa->0 index=9
W20250605 21:21:32.673305 629982 sample_reader.cc:345] Bogus LBR data (range is 
negative): 7eaa->0 index=5
W20250605 21:21:32.673328 629982 sample_reader.cc:345] Bogus LBR data (range is 
negative): 7300->0 index=c
W20250605 21:21:32.673385 629982 sample_reader.cc:345] Bogus LBR data (range is 
negative): 6d70->0 index=f
W20250605 21:21:32.673389 629982 sample_reader.cc:345] Bogus LBR data (range is 
negative): 6d70->0 index=6
W20250605 21:21:32.673393 629982 sample_reader.cc:345] Bogus LBR data (range is 
negative): 6d70->0 index=d
I20250605 21:21:32.673444 629982 symbol_map.cc:477] Adding loadable exec 
segment: offset=1000 vaddr=401000

it is chatty.  I especially like the segfault if you give no parameters
to profile_merger.
> 
> > taken branches 10000000 may be bit high?
> 
> It's rather low. Otherwise the files get really big. Really should use a
> pipe to avoid the temporary files but the standard autofdo toolkit insists
> on loading everything into memory.

I suppose it also depends on what precision we want.  I was checking the
IPA dump. It builds histogram and determines 99% cutoff. It gets 44 that
is quite low count.

However also this logic may be somewhat broken, since create_gcov itself
cuts off at about 98% if I recal correctly, so these two passes likely
addo to each other and we get approx. 97%

So I guess we can drop --all -c 10000000 and try to get it working with
default values on Intel, AMD and arm?

We are trying to set up SPEC benchmarking so the code gets tested.
In this case the train runs are quite short-lived. What -c would be
reasonable?
Shold it be prime or prime-1? :)
Honza

> 
> Andi
> 
> > 
> > Honza
> > > 
> > > Honza

Reply via email to