> -----Original Message-----
> From: Prathamesh Kulkarni <[email protected]>
> Sent: 23 October 2025 10:39
> To: [email protected]; Jan Hubicka <[email protected]>
> Subject: RE: [RFC] Enable time profile function reordering with
> AutoFDO
>
> External email: Use caution opening links or attachments
>
>
> > -----Original Message-----
> > From: Prathamesh Kulkarni <[email protected]>
> > Sent: 13 October 2025 20:25
> > To: Prathamesh Kulkarni <[email protected]>; gcc-
> > [email protected]; Jan Hubicka <[email protected]>
> > Subject: RE: [RFC] Enable time profile function reordering with
> > AutoFDO
> >
> >
> >
> > > -----Original Message-----
> > > From: Prathamesh Kulkarni <[email protected]>
> > > Sent: 06 October 2025 19:41
> > > To: [email protected]; Jan Hubicka <[email protected]>
> > > Subject: [RFC] Enable time profile function reordering with
> AutoFDO
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > Hi Honza,
> > > The attached patch enables time profile based reordering with
> > AutoFDO
> > > with -fauto-profile -fprofile-reorder-functions, by mapping
> > timestamps
> > > obtained from perf into node->tp_first_run, and is based on top of
> > > Dhruv's sourcefile tracking patch:
> > > https://gcc.gnu.org/pipermail/gcc-patches/2025-
> September/694800.html
> > >
> > > The rationale for doing this is:
> > > (1) GCC already implements time-profile function reordering, the
> > patch
> > > enables it with AutoFDO.
> > > (2) While time profile ordering is primarily meant for optimizing
> > > startup time, we've also observed good effects on code-locality
> for
> > > large internal workloads.
> > > (3) Possibly useful for function reordering when accurate profile
> > > annotation is hard with AutoFDO -- For eg, if branch samples are
> > > missing (due to absence of LBR like structure).
> > >
> > > On AutoFDO tools side, I have a patch that extends gcov to emit
> 64-
> > bit
> > > perf timestamp that records first execution of function, which
> > loosely
> > > corresponds to PGO's time_profile counter.
> > > The timestamp is stored adjacent to head field in toplevel
> function
> > > info.
> > > I will post a patch for this shortly on AutoFDO tools upstream
> repo.
> > >
> > > On GCC side, the patch makes the following changes:
> > >
> > > (1) Changes to auto-profile pass:
> > > The patch adds a new field timestamp to function_instance, and
> > > populates it in read_function_instance.
> > >
> > > It maintains a new timestamp_info_map from timestamp -> <name,
> > > tp_first_run>, which maps timestamps sorted in ascending order to
> > > (1..N), so lowest ordered timestamp is mapped to 1 and so on. The
> > > rationale for this is that timestamps are 64-bit integers, and we
> > > don't need the full 64-bit range for ordering by tp_first_run.
> > >
> > > During annotation, the timestamp associated with function_instance
> > is
> > > looked up in timestamp_info_map, and corresponding mapped value is
> > > assigned to node->tp_first_run.
> > >
> > > (2) Handling clones:
> > > Currently, for clones not registered in call graph before auto-
> > profile
> > > pass, the tp_first_run field is copied from original function,
> when
> > > the clone is created.
> > > However that may not correspond to the actual order of functions.
> > >
> > > For eg, if we have two profiled clones of foo:
> > > foo.constprop.1, foo.constprop.2
> > >
> > > both will get same value for tp_first_run as foo->tp_first_run,
> > which
> > > might not correspond to time profile order.
> > >
> > > To address this, the patch introduces a new IPA pass
> > > ipa_adjust_tp_first_run, that streams <clone name, tp_first_run>
> > from
> > > timestamp_info_map during LGEN, and during WPA reads it, and sets
> > > clone's tp_first_run field accordingly.
> > > The pass is placed pretty late (just before locality_cloning), by
> > that
> > > point clones would be registered in the call graph.
> > >
> > > Dhruv's sourcefile tracking patch already handles LTO privatized
> > > functions.
> > > The patch adds a (temporary) workaround for functions with
> > > mismatched/empty filenames from gcov, to avoid getting dropped in
> > > afdo_annotate_cfg by iterating thru all filenames in
> > afdo_string_table
> > > if get_function_instance_by_decl fails to find function_instance
> > with
> > > lbasename (DECL_SOURCE_FILE (decl)).
> > >
> > > (3) Grouping profiled functions together in as few partitions as
> > > possible (preferably single).
> > > The patch places profiled functions in time profile order together
> > in
> > > as few paritions as possible to get better advantage of code
> > locality.
> > > Unlike PGO, where every instrumented function gets a time profile
> > > counter, with AutoFDO, the sampled functions are a fraction of the
> > > total executed ones.
> > > Similarly, in default_function_section, it overrides hot/cold
> > > partitioning so that grouping of profiled functions isn't
> disrupted.
> > >
> > > (4) Option to disable profile driven opts.
> > > The patch adds option -fauto-profile-reorder-only which only
> enables
> > > time-profile reordering with AutoFDO (and disables profile driven
> > > opts):
> > > (a) Useful as a debugging aid to isolate regression to either
> > function
> > > reordering or profile driven opts.
> > > (b) For our use case, it's also seemingly useful as a stopgap
> > measure
> > > to avoid regressions with AutoFDO profile driven opts, due to
> issues
> > > with profile quality obtained with merging of SPE and non SPE
> > > profiles.
> > > We're actively working on resolving this.
> > > (c) Possibly useful for architectures which do not support branch
> > > sampling.
> > > The option is disabled by default.
> > >
> > > Ideally, I would like to make it a param (and not user facing
> > option),
> > > but I am not able to control enabling/disabling options in
> > > opts.cc:common_handle_option based on param value, will
> investigate
> > > this further.
> > >
> > > * Results
> > >
> > > On one large interal workload, the patch (along with sourcefile
> > > tracking patch), gives an uplift of 32.63% compared to LTO, and
> > 8.07%
> > > compared to LTO + AutoFDO trunk, and for another workload it gives
> > an
> > > uplift of 15.31% compared to LTO, and 7.76% compared to LTO +
> > AutoFDO
> > > trunk.
> > > I will try benchmarking with SPEC2017.
> > >
> > > Will be grateful for suggestions on how to proceed further.
> > Hi,
> > ping: https://gcc.gnu.org/pipermail/gcc-patches/2025-
> > October/696758.html
> Hi,
> ping * 2: https://gcc.gnu.org/pipermail/gcc-patches/2025-
> October/696758.html
Hi,
ping * 3: https://gcc.gnu.org/pipermail/gcc-patches/2025-October/696758.html
Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > > Signed-off-by: Prathamesh Kulkarni <[email protected]>
> > >
> > > Thanks,
> > > Prathamesh