Re: [PATCH][RFC] Add new ipa-reorder pass
Martin Liška writes: > > Notes and limitations: > - The call-chain-clustering algorithm requires to fit as many as possible > functions into page size (4K). > Based on my measurements that should correspond to ~1000 GIMPLE statements > (IPA inliner size). I can > make it a param in the future. That sounds inaccurate. I would assume e.g. integer code has a quite different factor than floating point code. Perhaps it would need something fancier. If you use a static factor you probably would need to calibrate it over a lot more code. Anyways, I think it should be a param, or better even an option, because there's a trend towards using 2MB code pages on x86. Linux has a number of ways to do that today. Also of course that's needed for other architectures. -Andi
Re: [PATCH][RFC] Add new ipa-reorder pass
> * Makefile.in: Add ipa-reorder.o. > * cgraph.c (cgraph_node::dump): Dump > text_sorted_order. > (cgraph_node_cmp_by_text_sorted): > New function that sorts functions based > on text_sorted_order. > * cgraph.h (cgraph_node): Add text_sorted_order. > (cgraph_node_cmp_by_text_sorted): New. > * cgraphclones.c (cgraph_node::create_clone): > Clone also text_sorted_order. > * cgraphunit.c (node_cmp): Remove. > (expand_all_functions): Use new function > cgraph_node_cmp_by_text_sorted. > * common.opt: Add new option reorder_functions_algorithm. > * flag-types.h (enum reorder_functions_algorithm): > New enum. > * ipa-reorder.c: New file. > * lto-cgraph.c (lto_output_node): Stream in and out > text_sorted_order. > (input_node): Likewise. > * passes.def: Add pass_ipa_reorder. > * timevar.def (TV_IPA_REORDER): New. > * tree-pass.h (make_pass_ipa_reorder): New. > * varasm.c (default_function_section): Assign text.sorted.X > section. > > gcc/lto/ChangeLog: > > 2019-11-25 Martin Liska > > * lto-partition.c (node_cmp): Remove. > (lto_balanced_map): Use new cgraph_node_cmp_by_text_sorted. > +/* Sort cgraph_nodes by text_sorted_order if available, or by order. */ > + > +int > +cgraph_node_cmp_by_text_sorted (const void *pa, const void *pb) > +{ > + const cgraph_node *a = *(const cgraph_node * const *) pa; > + const cgraph_node *b = *(const cgraph_node * const *) pb; > + > + /* Functions with text_sorted_order should be before these > + without profile. */ > + if (a->text_sorted_order == 0 || b->text_sorted_order == 0) > +return a->text_sorted_order - b->text_sorted_order; > + > + return a->text_sorted_order != b->text_sorted_order > + ? b->text_sorted_order - a->text_sorted_order > + : b->order - a->order; > +} We now have tp_first_run_node_cmp with needs to be updated and renamed isntead of adding new comparator. >/* Time profiler: first run of function. */ >int tp_first_run; > + /* Order in .text.sorted.* section. */ > + int text_sorted_order; tp_first_run probalby should go into summary since it will live from profling to the ipa-reorder pass where it will be copied to text_sorted_order? Or how these two interface? > diff --git a/gcc/cgraphclones.c b/gcc/cgraphclones.c > index a79491e0b88..f624cbee185 100644 > --- a/gcc/cgraphclones.c > +++ b/gcc/cgraphclones.c > @@ -372,6 +372,7 @@ cgraph_node::create_clone (tree new_decl, profile_count > prof_count, >new_node->rtl = rtl; >new_node->frequency = frequency; >new_node->tp_first_run = tp_first_run; > + new_node->text_sorted_order = text_sorted_order; Copy it also in duplicate_thunk_for_node and create_version_node. I think we should refactor the copying code and share it in all places we duplicate function. > +freorder-functions-algorithm= > +Common Joined RejectNegative Enum(reorder_functions_algorithm) > Var(flag_reorder_functions_algorithm) > Init(REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING) Optimization > +-freorder-functions-algorithm=[first-run|call-chain-clustering] Set the > used function reordering algorithm. > + > +Enum > +Name(reorder_functions_algorithm) Type(enum reorder_functions_algorithm) > UnknownError(unknown function reordering algorithm %qs) > + > +EnumValue > +Enum(reorder_functions_algorithm) String(first-run) > Value(REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN) > + > +EnumValue > +Enum(reorder_functions_algorithm) String(call-chain-clustering) > Value(REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING) I think we should also have "definition-order" which will sort by node->order. For testing we may also want to have "random" which will do something stupid like sorting by symbol name checksum so we could see how these compares. > +/* The algorithm used for function reordering. */ > +enum reorder_functions_algorithm > +{ > + REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN, > + REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING > +}; Isn't this supposed to be generated by awk scripts? > + > +#define C3_CLUSTER_THRESHOLD 1024 description + perhaps PARAM? > + > +struct cluster_edge; > + > +/* Cluster is set of functions considered in C3 algorithm. */ Perhaps an high level overview of C3 algorithm with a reference would be nice. > +/* Sort functions based of first execution of the function. */ > + > +static void > +sort_functions_by_first_run (void) > +{ > + cgraph_node *node; > + > + FOR_EACH_DEFINED_FUNCTION (node) > +if (node->tp_first_run && !node->alias) > + node->text_sorted_order = node->tp_first_run; Why you care about excluding aliases? We also have thunks where tp_first_run is ignored. I am just curious if you got some ICE here. > +} > + > +/* Compare clusters by density after that are established. */ > + > +static int > +cluster_cmp (const void *a_p, const void *b_p) > +{ > + const cluster *a = *(cluster
Re: [PATCH][RFC] Add new ipa-reorder pass
Hi. I've just updated the script a bit and I added also address histogram: https://drive.google.com/file/d/11s9R_JnEMohDE6ctqzsj092QD22HKXJI/view?usp=sharing Martin
Re: [PATCH][RFC] Add new ipa-reorder pass
> > On the first glance the difference between gcc9 and gcc10 is explained > > by the changes to profile updating. gcc9 makes very small cold > > partitions compared to gcc10. It is very nice that we have a way to > > measure it. I will also check if some of the more important profiling > > update fixes makes sense to backport to gcc9. > > > > Over weekend I did some fixes to tp reordreing, so it may be nice to > > update your tests, but I will try to run it myself. > > > > In general one can see individual stages of compilation on the graph - > > parsing, early lowering, early opts. On bigger programs this should be > > more visible. I will give it a try. > > You haven't replied to question whether we want to let ipa-reorder into > trunk based on the sent images for GCC 10 PGO+LTO boostrap? My concern is still the same - while I like the patch I am worried that we have only one example where it produces some benefit. For this reason I looked into tp_first_run issues this weekend and found fixed some issues / verified that the order now seems to be fine for cc1 binary and partly for Firefox. We do run periodic benchmarks to keep optimization in shape, but code layout stuff is out in wild and no one is verifying that it works. It is now bit nontrivial piece of code and thus it is not big suprise that there are numeber of bugs. I like the heatmap generator (but for some reason it generates empty pngs for me. Maybe things are just out of range since bounds are hardcoded in your script) I don't have much time today but tomorrow I will try to get to it tested and send some comments on the patch. Honza
Re: [PATCH][RFC] Add new ipa-reorder pass
On 12/9/19 2:03 PM, Jan Hubicka wrote: On 12/9/19 1:14 PM, Martin Liška wrote: Hello. Based on presentation that had Sriraman Tallam at a LLVM conference: https://www.youtube.com/watch?v=DySuXFGmB40 I made a heatmap based on executed instruction addresses. I used $ perf record -F max -- ./cc1plus -fpreprocessed /home/marxin/Programming/tramp3d/tramp3d-v4.ii and $ perf script -F time,ip,dso I'm sending link for my system GCC 9 (PGO+lean LTO bootstrap), GCC 10 before and after my reorder patch (also PGO+lean LTO bootstrap). One can see quite significant clustering starting from 5s till the end of compilation. Link: https://drive.google.com/open?id=1M0YlxvQPyiVguy5VWRC8dG52UArwAuKS Martin For the completeness, the heatmap was generated with the following script: https://github.com/marxin/script-misc/blob/master/binary-heatmap.py Thanks, this looks really useful as we had almost no way to check code layout ever since you systemtap script stopped working. Great, thanks. On the first glance the difference between gcc9 and gcc10 is explained by the changes to profile updating. gcc9 makes very small cold partitions compared to gcc10. It is very nice that we have a way to measure it. I will also check if some of the more important profiling update fixes makes sense to backport to gcc9. Over weekend I did some fixes to tp reordreing, so it may be nice to update your tests, but I will try to run it myself. In general one can see individual stages of compilation on the graph - parsing, early lowering, early opts. On bigger programs this should be more visible. I will give it a try. You haven't replied to question whether we want to let ipa-reorder into trunk based on the sent images for GCC 10 PGO+LTO boostrap? Martin Honza
Re: [PATCH][RFC] Add new ipa-reorder pass
> On 12/9/19 1:14 PM, Martin Liška wrote: > > Hello. > > > > Based on presentation that had Sriraman Tallam at a LLVM conference: > > https://www.youtube.com/watch?v=DySuXFGmB40 > > > > I made a heatmap based on executed instruction addresses. I used > > $ perf record -F max -- ./cc1plus -fpreprocessed > > /home/marxin/Programming/tramp3d/tramp3d-v4.ii > > and > > $ perf script -F time,ip,dso > > > > I'm sending link for my system GCC 9 (PGO+lean LTO bootstrap), GCC 10 > > before and after my reorder > > patch (also PGO+lean LTO bootstrap). > > > > One can see quite significant clustering starting from 5s till the end of > > compilation. > > Link: https://drive.google.com/open?id=1M0YlxvQPyiVguy5VWRC8dG52UArwAuKS > > > > Martin > > For the completeness, the heatmap was generated with the following script: > https://github.com/marxin/script-misc/blob/master/binary-heatmap.py Thanks, this looks really useful as we had almost no way to check code layout ever since you systemtap script stopped working. On the first glance the difference between gcc9 and gcc10 is explained by the changes to profile updating. gcc9 makes very small cold partitions compared to gcc10. It is very nice that we have a way to measure it. I will also check if some of the more important profiling update fixes makes sense to backport to gcc9. Over weekend I did some fixes to tp reordreing, so it may be nice to update your tests, but I will try to run it myself. In general one can see individual stages of compilation on the graph - parsing, early lowering, early opts. On bigger programs this should be more visible. I will give it a try. Honza
Re: [PATCH][RFC] Add new ipa-reorder pass
On 12/9/19 1:14 PM, Martin Liška wrote: Hello. Based on presentation that had Sriraman Tallam at a LLVM conference: https://www.youtube.com/watch?v=DySuXFGmB40 I made a heatmap based on executed instruction addresses. I used $ perf record -F max -- ./cc1plus -fpreprocessed /home/marxin/Programming/tramp3d/tramp3d-v4.ii and $ perf script -F time,ip,dso I'm sending link for my system GCC 9 (PGO+lean LTO bootstrap), GCC 10 before and after my reorder patch (also PGO+lean LTO bootstrap). One can see quite significant clustering starting from 5s till the end of compilation. Link: https://drive.google.com/open?id=1M0YlxvQPyiVguy5VWRC8dG52UArwAuKS Martin For the completeness, the heatmap was generated with the following script: https://github.com/marxin/script-misc/blob/master/binary-heatmap.py Martin
Re: [PATCH][RFC] Add new ipa-reorder pass
Hello. Based on presentation that had Sriraman Tallam at a LLVM conference: https://www.youtube.com/watch?v=DySuXFGmB40 I made a heatmap based on executed instruction addresses. I used $ perf record -F max -- ./cc1plus -fpreprocessed /home/marxin/Programming/tramp3d/tramp3d-v4.ii and $ perf script -F time,ip,dso I'm sending link for my system GCC 9 (PGO+lean LTO bootstrap), GCC 10 before and after my reorder patch (also PGO+lean LTO bootstrap). One can see quite significant clustering starting from 5s till the end of compilation. Link: https://drive.google.com/open?id=1M0YlxvQPyiVguy5VWRC8dG52UArwAuKS Martin
Re: [PATCH][RFC] Add new ipa-reorder pass
On 12/1/19 11:37 PM, Jan Hubicka wrote: Hi, I was playing with it a bit more and built with -fno-profile-reorder-functions. Here is -fno-profile-reorder-functions compared to first run https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=3d537be0cb37458e7928f69a37efb2a6d6b85eae=try=4543abfa08870391544b56d16dfcd530dac0dc30=1 2.2% improvement on the page rendering is off noise but I would hope for bit more. Here is -fno-profile-reorder-functions compared to clustering https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=3d537be0cb37458e7928f69a37efb2a6d6b85eae=try=1c2d53b10b042c15edbe7bd26e2740641840=1 I am not sure if it is a noise. Either Fireox's talos is not good benchmark for code layout (which I would be surprised since it is quite sensitive to code size issues) or there are some problems. In general I think the patch is useful and mostly mainline ready except for detailes but it would be good to have some more evidence that it works as expected on large binaries besides tramp3d build time. There are number of ways where things may go wrong ranging from misupdated profiles, problems with function splitting, comdats and other issues. Based on my testing, I was able to see cc1plus binary really sorted as seen by the pass, including various IPA clones that inherited order from their origins. Was you able to benchmark some other benefits? Unfortunately not. I remember we discussed collecting traces from valgrind, perhaps we could test that they are looking good? I have some semi-working icegrind port here: https://github.com/marxin/valgrind/tree/icegrind It will probably need some extra work and it's terribly slow. But you can try it. I would first wait for the Firefox dump files and then we can discuss what to do with the pass. Martin Honza
Re: [PATCH][RFC] Add new ipa-reorder pass
On 12/1/19 11:45 AM, Jan Hubicka wrote: Hi. I'm sending v3 of the patch where I changed: - function.cold sections are properly put into .text.unlikely and not into a .text.sorted.XYZ section I've just finished measurements and I still have the original speed up for tramp3d: Total runs: 10, before: 13.92, after: 13.82, cmp: 99.219% Hi, I have updated binutils to current head on the Firefox testing patch and run FDO build with tp first run ordering and call chain clustering. https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=1313e6a4d74ebff702afa7594684beb83c01d95f=try=1c2d53b10b042c15edbe7bd26e2740641840=1 Hello. Thank you for the testing. It seems there are no differences in performance. The two binaries can be downloaded at w/o patch: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MK-7DC3FQcevZC_Nvlnq8Q/runs/0/artifacts/public/build/target.tar.bz2 with call chain clustering. https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UVh6iNILT-qb8sYM5vxVCQ/runs/0/artifacts/public/build/target.tar.bz2 I would ideally need output of -fdump-ipa-reorder which prints sorted symbols and so that I can compare it with resulting assembly. Since Firefox is quite sensitive to code size I would expect to be able to measure some benefits here. Any idea what may have go wrong? That's a pity. I checked that the binaries seems generally sane - out of 58MB text segment there is 34MB cold section. It is possible that system ld is used instead of provided one, but that would be weird. I will try to find way to double-check that updating binutils really updated them for GCC. I can double check once having the dump file. Martin Honza
Re: [PATCH][RFC] Add new ipa-reorder pass
On 12/1/19 12:11 PM, Jan Hubicka wrote: Hi. I'm sending v3 of the patch where I changed: - function.cold sections are properly put into .text.unlikely and not into a .text.sorted.XYZ section I've just finished measurements and I still have the original speed up for tramp3d: Total runs: 10, before: 13.92, after: 13.82, cmp: 99.219% Hi, I have updated binutils to current head on the Firefox testing patch and run FDO build with tp first run ordering and call chain clustering. https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=1313e6a4d74ebff702afa7594684beb83c01d95f=try=1c2d53b10b042c15edbe7bd26e2740641840=1 It seems there are no differences in performance. The two binaries can be downloaded at w/o patch: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MK-7DC3FQcevZC_Nvlnq8Q/runs/0/artifacts/public/build/target.tar.bz2 with call chain clustering. https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UVh6iNILT-qb8sYM5vxVCQ/runs/0/artifacts/public/build/target.tar.bz2 Since Firefox is quite sensitive to code size I would expect to be able to measure some benefits here. Any idea what may have go wrong? I checked that the binaries seems generally sane - out of 58MB text segment there is 34MB cold section. It is possible that system ld is used instead of provided one, but that would be weird. I will try to find way to double-check that updating binutils really updated them for GCC. Linker used is GNU ld (GNU Binutils) 2.33.50.20191130 Hello. You can check linker support with: $ ~/bin/binutils/bin/ld.bfd --verbose | grep text.sorted *(SORT(.text.sorted.*)) Is Firefox using BFD or GOLD for linking? Does it have all necessary support in? Do I need something like -ffunction-sections? No, you don't need it. Martin Honza
Re: [PATCH][RFC] Add new ipa-reorder pass
Hi, I was playing with it a bit more and built with -fno-profile-reorder-functions. Here is -fno-profile-reorder-functions compared to first run https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=3d537be0cb37458e7928f69a37efb2a6d6b85eae=try=4543abfa08870391544b56d16dfcd530dac0dc30=1 2.2% improvement on the page rendering is off noise but I would hope for bit more. Here is -fno-profile-reorder-functions compared to clustering https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=3d537be0cb37458e7928f69a37efb2a6d6b85eae=try=1c2d53b10b042c15edbe7bd26e2740641840=1 I am not sure if it is a noise. Either Fireox's talos is not good benchmark for code layout (which I would be surprised since it is quite sensitive to code size issues) or there are some problems. In general I think the patch is useful and mostly mainline ready except for detailes but it would be good to have some more evidence that it works as expected on large binaries besides tramp3d build time. There are number of ways where things may go wrong ranging from misupdated profiles, problems with function splitting, comdats and other issues. Was you able to benchmark some other benefits? I remember we discussed collecting traces from valgrind, perhaps we could test that they are looking good? Honza
Re: [PATCH][RFC] Add new ipa-reorder pass
> > Hi. > > > > I'm sending v3 of the patch where I changed: > > - function.cold sections are properly put into .text.unlikely and > > not into a .text.sorted.XYZ section > > > > I've just finished measurements and I still have the original speed up > > for tramp3d: > > Total runs: 10, before: 13.92, after: 13.82, cmp: 99.219% > > Hi, > I have updated binutils to current head on the Firefox testing patch and > run FDO build with tp first run ordering and call chain clustering. > https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=1313e6a4d74ebff702afa7594684beb83c01d95f=try=1c2d53b10b042c15edbe7bd26e2740641840=1 > > It seems there are no differences in performance. The two binaries can > be downloaded at > > w/o patch: > https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MK-7DC3FQcevZC_Nvlnq8Q/runs/0/artifacts/public/build/target.tar.bz2 > with call chain clustering. > https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UVh6iNILT-qb8sYM5vxVCQ/runs/0/artifacts/public/build/target.tar.bz2 > > Since Firefox is quite sensitive to code size I would expect to be able > to measure some benefits here. Any idea what may have go wrong? > I checked that the binaries seems generally sane - out of 58MB text > segment there is 34MB cold section. It is possible that system ld is > used instead of provided one, but that would be weird. I will try to > find way to double-check that updating binutils really updated them for > GCC. Linker used is GNU ld (GNU Binutils) 2.33.50.20191130 Does it have all necessary support in? Do I need something like -ffunction-sections? Honza
Re: [PATCH][RFC] Add new ipa-reorder pass
> Hi. > > I'm sending v3 of the patch where I changed: > - function.cold sections are properly put into .text.unlikely and > not into a .text.sorted.XYZ section > > I've just finished measurements and I still have the original speed up > for tramp3d: > Total runs: 10, before: 13.92, after: 13.82, cmp: 99.219% Hi, I have updated binutils to current head on the Firefox testing patch and run FDO build with tp first run ordering and call chain clustering. https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=1313e6a4d74ebff702afa7594684beb83c01d95f=try=1c2d53b10b042c15edbe7bd26e2740641840=1 It seems there are no differences in performance. The two binaries can be downloaded at w/o patch: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MK-7DC3FQcevZC_Nvlnq8Q/runs/0/artifacts/public/build/target.tar.bz2 with call chain clustering. https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UVh6iNILT-qb8sYM5vxVCQ/runs/0/artifacts/public/build/target.tar.bz2 Since Firefox is quite sensitive to code size I would expect to be able to measure some benefits here. Any idea what may have go wrong? I checked that the binaries seems generally sane - out of 58MB text segment there is 34MB cold section. It is possible that system ld is used instead of provided one, but that would be weird. I will try to find way to double-check that updating binutils really updated them for GCC. Honza
Re: [PATCH][RFC] Add new ipa-reorder pass
Hi. I'm sending v3 of the patch where I changed: - function.cold sections are properly put into .text.unlikely and not into a .text.sorted.XYZ section I've just finished measurements and I still have the original speed up for tramp3d: Total runs: 10, before: 13.92, after: 13.82, cmp: 99.219% Thoughts? Martin >From f3fd85c29a4a2746555294bd30e3c31129030074 Mon Sep 17 00:00:00 2001 From: Martin Liska Date: Thu, 5 Sep 2019 13:32:41 +0200 Subject: [PATCH] Add new ipa-reorder pass. gcc/ChangeLog: 2019-11-25 Martin Liska * Makefile.in: Add ipa-reorder.o. * cgraph.c (cgraph_node::dump): Dump text_sorted_order. (cgraph_node_cmp_by_text_sorted): New function that sorts functions based on text_sorted_order. * cgraph.h (cgraph_node): Add text_sorted_order. (cgraph_node_cmp_by_text_sorted): New. * cgraphclones.c (cgraph_node::create_clone): Clone also text_sorted_order. * cgraphunit.c (node_cmp): Remove. (expand_all_functions): Use new function cgraph_node_cmp_by_text_sorted. * common.opt: Add new option reorder_functions_algorithm. * flag-types.h (enum reorder_functions_algorithm): New enum. * ipa-reorder.c: New file. * lto-cgraph.c (lto_output_node): Stream in and out text_sorted_order. (input_node): Likewise. * passes.def: Add pass_ipa_reorder. * timevar.def (TV_IPA_REORDER): New. * tree-pass.h (make_pass_ipa_reorder): New. * varasm.c (default_function_section): Assign text.sorted.X section. gcc/lto/ChangeLog: 2019-11-25 Martin Liska * lto-partition.c (node_cmp): Remove. (lto_balanced_map): Use new cgraph_node_cmp_by_text_sorted. --- gcc/Makefile.in | 1 + gcc/cgraph.c| 20 +++ gcc/cgraph.h| 3 + gcc/cgraphclones.c | 1 + gcc/cgraphunit.c| 21 +-- gcc/common.opt | 14 ++ gcc/flag-types.h| 8 + gcc/ipa-reorder.c | 383 gcc/lto-cgraph.c| 2 + gcc/lto/lto-partition.c | 33 +--- gcc/passes.def | 1 + gcc/timevar.def | 1 + gcc/tree-pass.h | 1 + gcc/varasm.c| 9 + 14 files changed, 448 insertions(+), 50 deletions(-) create mode 100644 gcc/ipa-reorder.c diff --git a/gcc/Makefile.in b/gcc/Makefile.in index 7d3c13230e4..163d47c47f1 100644 --- a/gcc/Makefile.in +++ b/gcc/Makefile.in @@ -1369,6 +1369,7 @@ OBJS = \ init-regs.o \ internal-fn.o \ ipa-cp.o \ + ipa-reorder.o \ ipa-sra.o \ ipa-devirt.o \ ipa-fnsummary.o \ diff --git a/gcc/cgraph.c b/gcc/cgraph.c index 180d21e4796..9c09d63e4ee 100644 --- a/gcc/cgraph.c +++ b/gcc/cgraph.c @@ -1951,6 +1951,8 @@ cgraph_node::dump (FILE *f) } if (tp_first_run > 0) fprintf (f, " first_run:%i", tp_first_run); + if (text_sorted_order > 0) +fprintf (f, " text_sorted_order:%i", text_sorted_order); if (origin) fprintf (f, " nested in:%s", origin->asm_name ()); if (gimple_has_body_p (decl)) @@ -3718,6 +3720,24 @@ cgraph_edge::possibly_call_in_translation_unit_p (void) return node->get_availability () >= AVAIL_INTERPOSABLE; } +/* Sort cgraph_nodes by text_sorted_order if available, or by order. */ + +int +cgraph_node_cmp_by_text_sorted (const void *pa, const void *pb) +{ + const cgraph_node *a = *(const cgraph_node * const *) pa; + const cgraph_node *b = *(const cgraph_node * const *) pb; + + /* Functions with text_sorted_order should be before these + without profile. */ + if (a->text_sorted_order == 0 || b->text_sorted_order == 0) +return a->text_sorted_order - b->text_sorted_order; + + return a->text_sorted_order != b->text_sorted_order + ? b->text_sorted_order - a->text_sorted_order + : b->order - a->order; +} + /* A stashed copy of "symtab" for use by selftest::symbol_table_test. This needs to be a global so that it can be a GC root, and thus prevent the stashed copy from being garbage-collected if the GC runs diff --git a/gcc/cgraph.h b/gcc/cgraph.h index 0d2442c997c..039819db358 100644 --- a/gcc/cgraph.h +++ b/gcc/cgraph.h @@ -1435,6 +1435,8 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : public symtab_node unsigned int profile_id; /* Time profiler: first run of function. */ int tp_first_run; + /* Order in .text.sorted.* section. */ + int text_sorted_order; /* Set when decl is an abstract function pointed to by the ABSTRACT_DECL_ORIGIN of a reachable function. */ @@ -2451,6 +2453,7 @@ bool cgraph_function_possibly_inlined_p (tree); const char* cgraph_inline_failed_string (cgraph_inline_failed_t); cgraph_inline_failed_type_t cgraph_inline_failed_type (cgraph_inline_failed_t); +int cgraph_node_cmp_by_text_sorted (const void *pa, const void *pb); /* In cgraphunit.c */ void cgraphunit_c_finalize (void); diff --git a/gcc/cgraphclones.c b/gcc/cgraphclones.c index a79491e0b88..f624cbee185 100644 --- a/gcc/cgraphclones.c +++ b/gcc/cgraphclones.c @@ -372,6 +372,7 @@ cgraph_node::create_clone (tree new_decl, profile_count prof_count, new_node->rtl = rtl;
Re: [PATCH][RFC] Add new ipa-reorder pass
Hello. I'm sending v2 of the patch set based on the discussion I had with Honza. Changes from previous version: - I changed type of edge count from uint32_t to uint64_t. - The algorithm traverses recursively inline clones. - TDF_DUMP_DETAILS is supported and provides more information. - I added cgraph_node_cmp_by_text_sorted function that is used for symbol sorting both in cgraphunit.c and lto-cgraph.c. I made the same testing as before I get following numbers of tramp3d w/ -O2: Total runs: 15, before: 13.89, after: 13.81, cmp: 99.383% and iTLB-load-misses:u goes from 15,450,298 to 11,633,215. The corresponding binutils patch part was partially accepted (for BFD) and today I was given a feedback for ld.gold. Thoughts? Thanks, Martin >From 9d62da95dc8efd4194565fca6471b3ddb6109c9b Mon Sep 17 00:00:00 2001 From: Martin Liska Date: Thu, 5 Sep 2019 13:32:41 +0200 Subject: [PATCH] Add new ipa-reorder pass. gcc/ChangeLog: 2019-11-25 Martin Liska * Makefile.in: Add ipa-reorder.o. * cgraph.c (cgraph_node::dump): Dump text_sorted_order. (cgraph_node_cmp_by_text_sorted): New function that sorts functions based on text_sorted_order. * cgraph.h (cgraph_node): Add text_sorted_order. (cgraph_node_cmp_by_text_sorted): New. * cgraphclones.c (cgraph_node::create_clone): Clone also text_sorted_order. * cgraphunit.c (node_cmp): Remove. (expand_all_functions): Use new function cgraph_node_cmp_by_text_sorted. * common.opt: Add new option reorder_functions_algorithm. * flag-types.h (enum reorder_functions_algorithm): New enum. * ipa-reorder.c: New file. * lto-cgraph.c (lto_output_node): Stream in and out text_sorted_order. (input_node): Likewise. * passes.def: Add pass_ipa_reorder. * timevar.def (TV_IPA_REORDER): New. * tree-pass.h (make_pass_ipa_reorder): New. * varasm.c (default_function_section): Assign text.sorted.X section. gcc/lto/ChangeLog: 2019-11-25 Martin Liska * lto-partition.c (node_cmp): Remove. (lto_balanced_map): Use new cgraph_node_cmp_by_text_sorted. --- gcc/Makefile.in | 1 + gcc/cgraph.c| 20 +++ gcc/cgraph.h| 3 + gcc/cgraphclones.c | 1 + gcc/cgraphunit.c| 21 +-- gcc/common.opt | 14 ++ gcc/flag-types.h| 8 + gcc/ipa-reorder.c | 383 gcc/lto-cgraph.c| 2 + gcc/lto/lto-partition.c | 33 +--- gcc/passes.def | 1 + gcc/timevar.def | 1 + gcc/tree-pass.h | 1 + gcc/varasm.c| 9 + 14 files changed, 448 insertions(+), 50 deletions(-) create mode 100644 gcc/ipa-reorder.c diff --git a/gcc/Makefile.in b/gcc/Makefile.in index 7d3c13230e4..163d47c47f1 100644 --- a/gcc/Makefile.in +++ b/gcc/Makefile.in @@ -1369,6 +1369,7 @@ OBJS = \ init-regs.o \ internal-fn.o \ ipa-cp.o \ + ipa-reorder.o \ ipa-sra.o \ ipa-devirt.o \ ipa-fnsummary.o \ diff --git a/gcc/cgraph.c b/gcc/cgraph.c index 1f7a5c58d98..0e31582d2ce 100644 --- a/gcc/cgraph.c +++ b/gcc/cgraph.c @@ -1951,6 +1951,8 @@ cgraph_node::dump (FILE *f) } if (tp_first_run > 0) fprintf (f, " first_run:%i", tp_first_run); + if (text_sorted_order > 0) +fprintf (f, " text_sorted_order:%i", text_sorted_order); if (origin) fprintf (f, " nested in:%s", origin->asm_name ()); if (gimple_has_body_p (decl)) @@ -3696,6 +3698,24 @@ cgraph_edge::possibly_call_in_translation_unit_p (void) return node->get_availability () >= AVAIL_INTERPOSABLE; } +/* Sort cgraph_nodes by text_sorted_order if available, or by order. */ + +int +cgraph_node_cmp_by_text_sorted (const void *pa, const void *pb) +{ + const cgraph_node *a = *(const cgraph_node * const *) pa; + const cgraph_node *b = *(const cgraph_node * const *) pb; + + /* Functions with text_sorted_order should be before these + without profile. */ + if (a->text_sorted_order == 0 || b->text_sorted_order == 0) +return a->text_sorted_order - b->text_sorted_order; + + return a->text_sorted_order != b->text_sorted_order + ? b->text_sorted_order - a->text_sorted_order + : b->order - a->order; +} + /* A stashed copy of "symtab" for use by selftest::symbol_table_test. This needs to be a global so that it can be a GC root, and thus prevent the stashed copy from being garbage-collected if the GC runs diff --git a/gcc/cgraph.h b/gcc/cgraph.h index a4f14743f00..ad6ad7ef513 100644 --- a/gcc/cgraph.h +++ b/gcc/cgraph.h @@ -1431,6 +1431,8 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : public symtab_node unsigned int profile_id; /* Time profiler: first run of function. */ int tp_first_run; + /* Order in .text.sorted.* section. */ + int text_sorted_order; /* Set when decl is an abstract function pointed to by the ABSTRACT_DECL_ORIGIN of a reachable function. */ @@ -2447,6 +2449,7 @@ bool cgraph_function_possibly_inlined_p (tree); const char* cgraph_inline_failed_string (cgraph_inline_failed_t); cgraph_inline_failed_type_t
Re: [PATCH][RFC] Add new ipa-reorder pass
> > This is why we currently have way to order function when outputting them > > and use that with FDO (using Martin's first execution logic). This has > > drwarback of making the functions to flow in that order through late > > optimizations and RTL backend and thus we lose IPA-RA and some > > IP propagation (late pure/const/nothrow discovery). > > But you can also fix that by the parallelization GSoC project approach, > decoupling things at RTL expansion IPA-wise and using output order > for the latter (or even do the fragments in GCC itself by refactoring the > output machinery to use function-local "strings", assembling the final > file later). Refactoring output to handle multiple output "files" at the > same time might also help getting rid of that early-LTO-debug copying > stuff (now that it seems that linker-plugin extensions for partially claiming > files will never happen...) Multiple output files won't solve my concern about padding and relaxation, but yep, parallelizing RTL will definitly need to introduce way to hold final assembly and glue it together, so this problem may get solved independently which would be nice. Honza
Re: [PATCH][RFC] Add new ipa-reorder pass
On Sun, Oct 6, 2019 at 4:38 PM Jan Hubicka wrote: > > > On 9/19/19 2:33 AM, Martin Liška wrote: > > > Hi. > > > > > > Function reordering has been around for quite some time and a naive > > > implementation was also part of my diploma thesis some time ago. > > > Currently, the GCC can reorder function based on first execution, which > > > happens with PGO and LTO of course. Known limitation is that the order > > > is preserved only partially as various symbols go into different LTRANS > > > partitions. > > > > > > There has been some research in the area and I would point out the > > > Facebook paper > > > ([1]) and Sony presentation ([2]). Based on that, I decided to make a new > > > implementation > > > in the GCC that does the same (in a proper way). First part of the > > > enablement are patches > > > to ld.bfd and ld.gold that come up with a new section .text.sorted, that > > > is always sorted. > > > There's a snippet from the modified default linker script: > > Funny, I was doing this via linker scripts circa ~95, in fact that's why > > we have -ffunction-sections :-) We started with scripts which > > post-processed profiling data to create linker scripts for ELF systems. > > We had something for HPUX/SOM as well, but I can't remember what > > mechanism we used, it may have been a gross level sorting using the SOM > > section sort key mechanism - it only had 128 or 256 keys with a notable > > amount of them reserved. > > > > We had also built a linker with a basic level of interposition circa > > 1993 and explored various approaches to reordering executables. I'd > > just joined the group at the time and was responsible for wiring up > > stuff on the OS side, but eventually got the "pleasure" of owning the > > linker server. A lot of the C3 algorithmic stuff looks similar to what > > we did. > > For reference I am attaching early LTO version from circa 2010 :) > > > > Anyway... > > > > I don't see anything objectionable in here. It's noted as an RFC. Are > > you interested in pushing this forward for gcc-10? > > I think it is plan to get some form of code layout pass into GCC10. I > will test Martin's version on Firefox and see if it have any effect > here. It is generally quite hard to evaluate those changes (and it is > reason why I did not moved forward with version form 2010 - more > precisely it kept falling off the shelf for about a decade) > > If LLD has support for cgraph annotations and we support LLD, i think we > should add support for that, too - it will be very useful to compare > indvidiual implementations. > I believe there is gold support (off tree?) for that too and something > similar is also supported by other toolchains like AIX? > > One problem with the sections approach is that every section gets > aligned to largest code alignment inside of the section. Since our > alignments are (should be) often cache line based we get cca 30 bytes of > waste for every function that is quite a lot. > > This is why we currently have way to order function when outputting them > and use that with FDO (using Martin's first execution logic). This has > drwarback of making the functions to flow in that order through late > optimizations and RTL backend and thus we lose IPA-RA and some > IP propagation (late pure/const/nothrow discovery). But you can also fix that by the parallelization GSoC project approach, decoupling things at RTL expansion IPA-wise and using output order for the latter (or even do the fragments in GCC itself by refactoring the output machinery to use function-local "strings", assembling the final file later). Refactoring output to handle multiple output "files" at the same time might also help getting rid of that early-LTO-debug copying stuff (now that it seems that linker-plugin extensions for partially claiming files will never happen...) > So this approach has a drawback, too. It is why i was trying to push > myself to get gcc to use gas fragments :) > > Anyway, all these issues can be sovled incementally - lets see how > Maritn's patch work on Firefox and if we can get it tested elsewhere and > start from that. > > I will take a look into the details. > > Honza > > > > jeff > > > > Index: tree-pass.h > === > *** tree-pass.h (revision 164689) > --- tree-pass.h (working copy) > *** extern struct ipa_opt_pass_d pass_ipa_lt > *** 467,472 > --- 467,473 > extern struct ipa_opt_pass_d pass_ipa_lto_finish_out; > extern struct ipa_opt_pass_d pass_ipa_profile; > extern struct ipa_opt_pass_d pass_ipa_cdtor_merge; > + extern struct ipa_opt_pass_d pass_ipa_func_reorder; > > extern struct gimple_opt_pass pass_all_optimizations; > extern struct gimple_opt_pass pass_cleanup_cfg_post_optimizing; > Index: cgraphunit.c > === > *** cgraphunit.c(revision 164689) > --- cgraphunit.c(working copy) >
Re: [PATCH][RFC] Add new ipa-reorder pass
> On 9/19/19 2:33 AM, Martin Liška wrote: > > Hi. > > > > Function reordering has been around for quite some time and a naive > > implementation was also part of my diploma thesis some time ago. > > Currently, the GCC can reorder function based on first execution, which > > happens with PGO and LTO of course. Known limitation is that the order > > is preserved only partially as various symbols go into different LTRANS > > partitions. > > > > There has been some research in the area and I would point out the Facebook > > paper > > ([1]) and Sony presentation ([2]). Based on that, I decided to make a new > > implementation > > in the GCC that does the same (in a proper way). First part of the > > enablement are patches > > to ld.bfd and ld.gold that come up with a new section .text.sorted, that is > > always sorted. > > There's a snippet from the modified default linker script: > Funny, I was doing this via linker scripts circa ~95, in fact that's why > we have -ffunction-sections :-) We started with scripts which > post-processed profiling data to create linker scripts for ELF systems. > We had something for HPUX/SOM as well, but I can't remember what > mechanism we used, it may have been a gross level sorting using the SOM > section sort key mechanism - it only had 128 or 256 keys with a notable > amount of them reserved. > > We had also built a linker with a basic level of interposition circa > 1993 and explored various approaches to reordering executables. I'd > just joined the group at the time and was responsible for wiring up > stuff on the OS side, but eventually got the "pleasure" of owning the > linker server. A lot of the C3 algorithmic stuff looks similar to what > we did. For reference I am attaching early LTO version from circa 2010 :) > > Anyway... > > I don't see anything objectionable in here. It's noted as an RFC. Are > you interested in pushing this forward for gcc-10? I think it is plan to get some form of code layout pass into GCC10. I will test Martin's version on Firefox and see if it have any effect here. It is generally quite hard to evaluate those changes (and it is reason why I did not moved forward with version form 2010 - more precisely it kept falling off the shelf for about a decade) If LLD has support for cgraph annotations and we support LLD, i think we should add support for that, too - it will be very useful to compare indvidiual implementations. I believe there is gold support (off tree?) for that too and something similar is also supported by other toolchains like AIX? One problem with the sections approach is that every section gets aligned to largest code alignment inside of the section. Since our alignments are (should be) often cache line based we get cca 30 bytes of waste for every function that is quite a lot. This is why we currently have way to order function when outputting them and use that with FDO (using Martin's first execution logic). This has drwarback of making the functions to flow in that order through late optimizations and RTL backend and thus we lose IPA-RA and some IP propagation (late pure/const/nothrow discovery). So this approach has a drawback, too. It is why i was trying to push myself to get gcc to use gas fragments :) Anyway, all these issues can be sovled incementally - lets see how Maritn's patch work on Firefox and if we can get it tested elsewhere and start from that. I will take a look into the details. Honza > > jeff > Index: tree-pass.h === *** tree-pass.h (revision 164689) --- tree-pass.h (working copy) *** extern struct ipa_opt_pass_d pass_ipa_lt *** 467,472 --- 467,473 extern struct ipa_opt_pass_d pass_ipa_lto_finish_out; extern struct ipa_opt_pass_d pass_ipa_profile; extern struct ipa_opt_pass_d pass_ipa_cdtor_merge; + extern struct ipa_opt_pass_d pass_ipa_func_reorder; extern struct gimple_opt_pass pass_all_optimizations; extern struct gimple_opt_pass pass_cleanup_cfg_post_optimizing; Index: cgraphunit.c === *** cgraphunit.c(revision 164689) --- cgraphunit.c(working copy) *** cgraph_inline_p (struct cgraph_edge *e, *** 1517,1522 --- 1517,1529 return !e->inline_failed; } + static int + node_cmp (const void *pa, const void *pb) + { + const struct cgraph_node *a = *(const struct cgraph_node * const *) pa; + const struct cgraph_node *b = *(const struct cgraph_node * const *) pb; + return b->order - a->order; + } /* Expand all functions that must be output. *** cgraph_expand_all_functions (void) *** 1546,1551 --- 1553,1561 if (order[i]->process) order[new_order_pos++] = order[i]; + if (flag_ipa_reorder) + qsort (order, new_order_pos, sizeof (struct cgraph_node *), node_cmp); + for (i = new_order_pos - 1; i >= 0; i--) { node
Re: [PATCH][RFC] Add new ipa-reorder pass
On 9/19/19 2:33 AM, Martin Liška wrote: > Hi. > > Function reordering has been around for quite some time and a naive > implementation was also part of my diploma thesis some time ago. > Currently, the GCC can reorder function based on first execution, which > happens with PGO and LTO of course. Known limitation is that the order > is preserved only partially as various symbols go into different LTRANS > partitions. > > There has been some research in the area and I would point out the Facebook > paper > ([1]) and Sony presentation ([2]). Based on that, I decided to make a new > implementation > in the GCC that does the same (in a proper way). First part of the enablement > are patches > to ld.bfd and ld.gold that come up with a new section .text.sorted, that is > always sorted. > There's a snippet from the modified default linker script: Funny, I was doing this via linker scripts circa ~95, in fact that's why we have -ffunction-sections :-) We started with scripts which post-processed profiling data to create linker scripts for ELF systems. We had something for HPUX/SOM as well, but I can't remember what mechanism we used, it may have been a gross level sorting using the SOM section sort key mechanism - it only had 128 or 256 keys with a notable amount of them reserved. We had also built a linker with a basic level of interposition circa 1993 and explored various approaches to reordering executables. I'd just joined the group at the time and was responsible for wiring up stuff on the OS side, but eventually got the "pleasure" of owning the linker server. A lot of the C3 algorithmic stuff looks similar to what we did. Anyway... I don't see anything objectionable in here. It's noted as an RFC. Are you interested in pushing this forward for gcc-10? jeff
Re: [PATCH][RFC] Add new ipa-reorder pass
On Thu, Oct 3, 2019 at 12:46 AM Fangrui Song wrote: > > > On 2019-10-03, Andrew Pinski wrote: > >On Wed, Oct 2, 2019@9:52 PM Fangrui Song wrote: > >> > >> On 2019-09-24, Martin Liška wrote: > >> >On 9/19/19 10:33 AM, Martin Liška wrote: > >> >> - One needs modified binutils and I that would probably require a > >> >> configure detection. The only way > >> >> which I see is based on ld --version. I'm planning to make the > >> >> binutils submission soon. > >> > > >> >The patch submission link: > >> >https://sourceware.org/ml/binutils/2019-09/msg00219.html > >> > >> Hi Martin, > >> > >> I have a question about why .text.sorted.* are needed. > >> > >> The Sony presentation (your [2] link) embedded a new section > >> .llvm.call-graph-profile[3] to represent edges in the object files. The > >> linker (lld) collects all .llvm.call-graph-profile sections and does a > >> C3 layout. There is no need for new section type .text.sorted.* > >> > >> [3]: > >> https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s > >> > >> (Please CC me. I am not subscribed.) > > > >The idea of GCC's version is to that the modification needed to the > >linker is very little. And even without patching the linker > >script/linker, you don't need to much special and you don't need LD to > >do much work@all. > > I am afraid this can be a large limitation. Then .text.sorted.* can > only a) reorder functions within a translation unit, b) or reorder all > functions when LTO is enabled. I don't think it is that limited. In fact I think the other way around is more limited as you are now limited on changing the call graph data format. This reduces the complexity of the linker really. Linkers are already slow; why make them even slower? This patch is you only need to change the linker once and you don't have dependencies between the linker and compiler. Thanks, Andrew > > b) is possible only if all .text.sorted.* sections can be numbered. > > For the LLVM case, call graph profile can be used without LTO. When both > ThinLTO+PGO are enabled, however, the additional performance improvement > offered by the global call graph profile reordering is insignificant, > smaller than 1%.See you said it was less than 1% so why do it then? Thanks, Andrew Pinski
Re: [PATCH][RFC] Add new ipa-reorder pass
On 2019-10-03, Andrew Pinski wrote: On Wed, Oct 2, 2019@9:52 PM Fangrui Song wrote: On 2019-09-24, Martin Liška wrote: >On 9/19/19 10:33 AM, Martin Liška wrote: >> - One needs modified binutils and I that would probably require a configure detection. The only way >> which I see is based on ld --version. I'm planning to make the binutils submission soon. > >The patch submission link: >https://sourceware.org/ml/binutils/2019-09/msg00219.html Hi Martin, I have a question about why .text.sorted.* are needed. The Sony presentation (your [2] link) embedded a new section .llvm.call-graph-profile[3] to represent edges in the object files. The linker (lld) collects all .llvm.call-graph-profile sections and does a C3 layout. There is no need for new section type .text.sorted.* [3]: https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s (Please CC me. I am not subscribed.) The idea of GCC's version is to that the modification needed to the linker is very little. And even without patching the linker script/linker, you don't need to much special and you don't need LD to do much work@all. I am afraid this can be a large limitation. Then .text.sorted.* can only a) reorder functions within a translation unit, b) or reorder all functions when LTO is enabled. b) is possible only if all .text.sorted.* sections can be numbered. For the LLVM case, call graph profile can be used without LTO. When both ThinLTO+PGO are enabled, however, the additional performance improvement offered by the global call graph profile reordering is insignificant, smaller than 1%.
Re: [PATCH][RFC] Add new ipa-reorder pass
On Wed, Oct 2, 2019 at 9:52 PM Fangrui Song wrote: > > On 2019-09-24, Martin Liška wrote: > >On 9/19/19 10:33 AM, Martin Liška wrote: > >> - One needs modified binutils and I that would probably require a > >> configure detection. The only way > >> which I see is based on ld --version. I'm planning to make the binutils > >> submission soon. > > > >The patch submission link: > >https://sourceware.org/ml/binutils/2019-09/msg00219.html > > Hi Martin, > > I have a question about why .text.sorted.* are needed. > > The Sony presentation (your [2] link) embedded a new section > .llvm.call-graph-profile[3] to represent edges in the object files. The > linker (lld) collects all .llvm.call-graph-profile sections and does a > C3 layout. There is no need for new section type .text.sorted.* > > [3]: > https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s > > (Please CC me. I am not subscribed.) The idea of GCC's version is to that the modification needed to the linker is very little. And even without patching the linker script/linker, you don't need to much special and you don't need LD to do much work at all. Thanks, Andrew
Re: [PATCH][RFC] Add new ipa-reorder pass
On 2019-09-24, Martin Liška wrote: On 9/19/19 10:33 AM, Martin Liška wrote: - One needs modified binutils and I that would probably require a configure detection. The only way which I see is based on ld --version. I'm planning to make the binutils submission soon. The patch submission link: https://sourceware.org/ml/binutils/2019-09/msg00219.html Hi Martin, I have a question about why .text.sorted.* are needed. The Sony presentation (your [2] link) embedded a new section .llvm.call-graph-profile[3] to represent edges in the object files. The linker (lld) collects all .llvm.call-graph-profile sections and does a C3 layout. There is no need for new section type .text.sorted.* [3]: https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s (Please CC me. I am not subscribed.)
Re: [PATCH][RFC] Add new ipa-reorder pass
On 9/25/19 6:36 PM, Evgeny Kudryashov wrote: > On 2019-09-19 11:33, Martin Liška wrote: >> Hi. >> >> Function reordering has been around for quite some time and a naive >> implementation was also part of my diploma thesis some time ago. >> Currently, the GCC can reorder function based on first execution, which >> happens with PGO and LTO of course. Known limitation is that the order >> is preserved only partially as various symbols go into different >> LTRANS partitions. >> >> There has been some research in the area and I would point out the >> Facebook paper >> ([1]) and Sony presentation ([2]). Based on that, I decided to make a >> new implementation >> in the GCC that does the same (in a proper way). First part of the >> enablement are patches >> to ld.bfd and ld.gold that come up with a new section .text.sorted, >> that is always sorted. >> >> Thoughts? I would definitely welcome any interesting measurement on a >> bigger load. >> >> Martin >> > > Hi, Martin! > > Some time ago I tried to do the same but didn't go that far. Hello. > > I also used the C3 algorithm, except for the fact that ipa_fn_summary > contains information about size and time (somehow missed it). Which is a key part of the algorithm as one wants not to cross a page size boundary (in ideal case). > The linker option --sort-section=name was used for prototyping. It, > obviously, sorts sections and allows to place functions in the desired order > (by placing them into named sections .text.sorted., without patching > linkers or adjusting linker scripts). Yes, that can work, but having the new section explicitly sorted will not require a passing of an option to linker. > For testing my implementation I used several benchmarks from SPEC2006: > perlbench, sjeng and gobmk. Unfortunately, no significant positive changes > were obtained. > > I've tested the proposed pass on perlbench with train and reference input > (PGO+LTO as a base) but couldn't obtain stable results (still unclear > environment or perlbench specific reasons). Yes, it's quite difficult to measure something. One needs a huge .text section of a binary and a test-case which has a flat profile. I was able to measure results on the gcc itself. Martin > > Evgeny. >
Re: [PATCH][RFC] Add new ipa-reorder pass
On 2019-09-19 11:33, Martin Liška wrote: Hi. Function reordering has been around for quite some time and a naive implementation was also part of my diploma thesis some time ago. Currently, the GCC can reorder function based on first execution, which happens with PGO and LTO of course. Known limitation is that the order is preserved only partially as various symbols go into different LTRANS partitions. There has been some research in the area and I would point out the Facebook paper ([1]) and Sony presentation ([2]). Based on that, I decided to make a new implementation in the GCC that does the same (in a proper way). First part of the enablement are patches to ld.bfd and ld.gold that come up with a new section .text.sorted, that is always sorted. Thoughts? I would definitely welcome any interesting measurement on a bigger load. Martin Hi, Martin! Some time ago I tried to do the same but didn't go that far. I also used the C3 algorithm, except for the fact that ipa_fn_summary contains information about size and time (somehow missed it). The linker option --sort-section=name was used for prototyping. It, obviously, sorts sections and allows to place functions in the desired order (by placing them into named sections .text.sorted., without patching linkers or adjusting linker scripts). For testing my implementation I used several benchmarks from SPEC2006: perlbench, sjeng and gobmk. Unfortunately, no significant positive changes were obtained. I've tested the proposed pass on perlbench with train and reference input (PGO+LTO as a base) but couldn't obtain stable results (still unclear environment or perlbench specific reasons). Evgeny.
Re: [PATCH][RFC] Add new ipa-reorder pass
On 9/19/19 10:33 AM, Martin Liška wrote: > - One needs modified binutils and I that would probably require a configure > detection. The only way > which I see is based on ld --version. I'm planning to make the binutils > submission soon. The patch submission link: https://sourceware.org/ml/binutils/2019-09/msg00219.html