Re: [PATCH][RFC] Add new ipa-reorder pass

2019-12-11 Thread Andi Kleen
Martin Liška  writes:
>
> Notes and limitations:
> - The call-chain-clustering algorithm requires to fit as many as possible 
> functions into page size (4K).
>   Based on my measurements that should correspond to ~1000 GIMPLE statements 
> (IPA inliner size). I can
>   make it a param in the future.

That sounds inaccurate. I would assume e.g. integer code has a
quite different factor than floating point code. Perhaps it would
need something fancier.

If you use a static factor you probably would need to calibrate it
over a lot more code.

Anyways, I think it should be a param, or better even an option, because there's
a trend towards using 2MB code pages on x86. Linux has a number of ways to do
that today.

Also of course that's needed for other architectures.

-Andi


Re: [PATCH][RFC] Add new ipa-reorder pass

2019-12-10 Thread Jan Hubicka
>   * Makefile.in: Add ipa-reorder.o.
>   * cgraph.c (cgraph_node::dump): Dump
>   text_sorted_order.
>   (cgraph_node_cmp_by_text_sorted):
>   New function that sorts functions based
>   on text_sorted_order.
>   * cgraph.h (cgraph_node): Add text_sorted_order.
>   (cgraph_node_cmp_by_text_sorted): New.
>   * cgraphclones.c (cgraph_node::create_clone):
>   Clone also text_sorted_order.
>   * cgraphunit.c (node_cmp): Remove.
>   (expand_all_functions): Use new function
>   cgraph_node_cmp_by_text_sorted.
>   * common.opt: Add new option reorder_functions_algorithm.
>   * flag-types.h (enum reorder_functions_algorithm):
>   New enum.
>   * ipa-reorder.c: New file.
>   * lto-cgraph.c (lto_output_node): Stream in and out
>   text_sorted_order.
>   (input_node): Likewise.
>   * passes.def: Add pass_ipa_reorder.
>   * timevar.def (TV_IPA_REORDER): New.
>   * tree-pass.h (make_pass_ipa_reorder): New.
>   * varasm.c (default_function_section): Assign text.sorted.X
>   section.
> 
> gcc/lto/ChangeLog:
> 
> 2019-11-25  Martin Liska  
> 
>   * lto-partition.c (node_cmp): Remove.
>   (lto_balanced_map): Use new cgraph_node_cmp_by_text_sorted.
> +/* Sort cgraph_nodes by text_sorted_order if available, or by order.  */
> +
> +int
> +cgraph_node_cmp_by_text_sorted (const void *pa, const void *pb)
> +{
> +  const cgraph_node *a = *(const cgraph_node * const *) pa;
> +  const cgraph_node *b = *(const cgraph_node * const *) pb;
> +
> +  /* Functions with text_sorted_order should be before these
> + without profile.  */
> +  if (a->text_sorted_order == 0 || b->text_sorted_order == 0)
> +return a->text_sorted_order - b->text_sorted_order;
> +
> +  return a->text_sorted_order != b->text_sorted_order
> +  ? b->text_sorted_order - a->text_sorted_order
> +  : b->order - a->order;
> +}

We now have tp_first_run_node_cmp with needs to be updated and renamed
isntead of adding new comparator.
>/* Time profiler: first run of function.  */
>int tp_first_run;
> +  /* Order in .text.sorted.* section.  */
> +  int text_sorted_order;

tp_first_run probalby should go into summary since it will live from
profling to the ipa-reorder pass where it will be copied to
text_sorted_order? Or how these two interface?
> diff --git a/gcc/cgraphclones.c b/gcc/cgraphclones.c
> index a79491e0b88..f624cbee185 100644
> --- a/gcc/cgraphclones.c
> +++ b/gcc/cgraphclones.c
> @@ -372,6 +372,7 @@ cgraph_node::create_clone (tree new_decl, profile_count 
> prof_count,
>new_node->rtl = rtl;
>new_node->frequency = frequency;
>new_node->tp_first_run = tp_first_run;
> +  new_node->text_sorted_order = text_sorted_order;
Copy it also in duplicate_thunk_for_node and create_version_node.
I think we should refactor the copying code and share it in all places
we duplicate function.
> +freorder-functions-algorithm=
> +Common Joined RejectNegative Enum(reorder_functions_algorithm) 
> Var(flag_reorder_functions_algorithm) 
> Init(REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING) Optimization
> +-freorder-functions-algorithm=[first-run|call-chain-clustering]  Set the 
> used function reordering algorithm.
> +
> +Enum
> +Name(reorder_functions_algorithm) Type(enum reorder_functions_algorithm) 
> UnknownError(unknown function reordering algorithm %qs)
> +
> +EnumValue
> +Enum(reorder_functions_algorithm) String(first-run) 
> Value(REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN)
> +
> +EnumValue
> +Enum(reorder_functions_algorithm) String(call-chain-clustering) 
> Value(REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING)

I think we should also have "definition-order" which will sort by
node->order.  For testing we may also want to have "random" which will
do something stupid like sorting by symbol name checksum so we could see
how these compares.
> +/* The algorithm used for function reordering.  */
> +enum reorder_functions_algorithm
> +{
> +  REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN,
> +  REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING
> +};

Isn't this supposed to be generated by awk scripts?
> +
> +#define C3_CLUSTER_THRESHOLD 1024
description + perhaps PARAM?
> +
> +struct cluster_edge;
> +
> +/* Cluster is set of functions considered in C3 algorithm.  */
Perhaps an high level overview of C3 algorithm with a reference would be
nice.
> +/* Sort functions based of first execution of the function.  */
> +
> +static void
> +sort_functions_by_first_run (void)
> +{
> +  cgraph_node *node;
> +
> +  FOR_EACH_DEFINED_FUNCTION (node)
> +if (node->tp_first_run && !node->alias)
> +  node->text_sorted_order = node->tp_first_run;

Why you care about excluding aliases?  We also have thunks where
tp_first_run is ignored.  I am just curious if you got some ICE here.
> +}
> +
> +/* Compare clusters by density after that are established.  */
> +
> +static int
> +cluster_cmp (const void *a_p, const void *b_p)
> +{
> +  const cluster *a = *(cluster 

Re: [PATCH][RFC] Add new ipa-reorder pass

2019-12-09 Thread Martin Liška

Hi.

I've just updated the script a bit and I added also address histogram:
https://drive.google.com/file/d/11s9R_JnEMohDE6ctqzsj092QD22HKXJI/view?usp=sharing

Martin


Re: [PATCH][RFC] Add new ipa-reorder pass

2019-12-09 Thread Jan Hubicka
> > On the first glance the difference between gcc9 and gcc10 is explained
> > by the changes to profile updating. gcc9 makes very small cold
> > partitions compared to gcc10.  It is very nice that we have a way to
> > measure it. I will also check if some of the more important profiling
> > update fixes makes sense to backport to gcc9.
> > 
> > Over weekend I did some fixes to tp reordreing, so it may be nice to
> > update your tests, but I will try to run it myself.
> > 
> > In general one can see individual stages of compilation on the graph -
> > parsing, early lowering, early opts.  On bigger programs this should be
> > more visible.  I will give it a try.
> 
> You haven't replied to question whether we want to let ipa-reorder into
> trunk based on the sent images for GCC 10 PGO+LTO boostrap?

My concern is still the same - while I like the patch I am worried that
we have only one example where it produces some benefit. For this reason
I looked into tp_first_run issues this weekend and found fixed some
issues / verified that the order now seems to be fine for cc1 binary and
partly for Firefox.

We do run periodic benchmarks to keep optimization in shape, but code layout
stuff is out in wild and no one is verifying that it works. It is now
bit nontrivial piece of code and thus it is not big suprise that there
are numeber of bugs.

I like the heatmap generator (but for some reason it generates empty
pngs for me. Maybe things are just out of range since bounds are
hardcoded in your script) 

I don't have much time today but tomorrow I will try to get to it
tested and send some comments on the patch.

Honza


Re: [PATCH][RFC] Add new ipa-reorder pass

2019-12-09 Thread Martin Liška

On 12/9/19 2:03 PM, Jan Hubicka wrote:

On 12/9/19 1:14 PM, Martin Liška wrote:

Hello.

Based on presentation that had Sriraman Tallam at a LLVM conference:
https://www.youtube.com/watch?v=DySuXFGmB40

I made a heatmap based on executed instruction addresses. I used
$ perf record -F max -- ./cc1plus -fpreprocessed 
/home/marxin/Programming/tramp3d/tramp3d-v4.ii
and
$ perf script -F time,ip,dso

I'm sending link for my system GCC 9 (PGO+lean LTO bootstrap), GCC 10 before 
and after my reorder
patch (also PGO+lean LTO bootstrap).

One can see quite significant clustering starting from 5s till the end of 
compilation.
Link: https://drive.google.com/open?id=1M0YlxvQPyiVguy5VWRC8dG52UArwAuKS

Martin


For the completeness, the heatmap was generated with the following script:
https://github.com/marxin/script-misc/blob/master/binary-heatmap.py


Thanks,
this looks really useful as we had almost no way to check code layout
ever since you systemtap script stopped working.


Great, thanks.



On the first glance the difference between gcc9 and gcc10 is explained
by the changes to profile updating. gcc9 makes very small cold
partitions compared to gcc10.  It is very nice that we have a way to
measure it. I will also check if some of the more important profiling
update fixes makes sense to backport to gcc9.

Over weekend I did some fixes to tp reordreing, so it may be nice to
update your tests, but I will try to run it myself.

In general one can see individual stages of compilation on the graph -
parsing, early lowering, early opts.  On bigger programs this should be
more visible.  I will give it a try.


You haven't replied to question whether we want to let ipa-reorder into
trunk based on the sent images for GCC 10 PGO+LTO boostrap?

Martin



Honza





Re: [PATCH][RFC] Add new ipa-reorder pass

2019-12-09 Thread Jan Hubicka
> On 12/9/19 1:14 PM, Martin Liška wrote:
> > Hello.
> > 
> > Based on presentation that had Sriraman Tallam at a LLVM conference:
> > https://www.youtube.com/watch?v=DySuXFGmB40
> > 
> > I made a heatmap based on executed instruction addresses. I used
> > $ perf record -F max -- ./cc1plus -fpreprocessed 
> > /home/marxin/Programming/tramp3d/tramp3d-v4.ii
> > and
> > $ perf script -F time,ip,dso
> > 
> > I'm sending link for my system GCC 9 (PGO+lean LTO bootstrap), GCC 10 
> > before and after my reorder
> > patch (also PGO+lean LTO bootstrap).
> > 
> > One can see quite significant clustering starting from 5s till the end of 
> > compilation.
> > Link: https://drive.google.com/open?id=1M0YlxvQPyiVguy5VWRC8dG52UArwAuKS
> > 
> > Martin
> 
> For the completeness, the heatmap was generated with the following script:
> https://github.com/marxin/script-misc/blob/master/binary-heatmap.py

Thanks,
this looks really useful as we had almost no way to check code layout
ever since you systemtap script stopped working.

On the first glance the difference between gcc9 and gcc10 is explained
by the changes to profile updating. gcc9 makes very small cold
partitions compared to gcc10.  It is very nice that we have a way to
measure it. I will also check if some of the more important profiling
update fixes makes sense to backport to gcc9.

Over weekend I did some fixes to tp reordreing, so it may be nice to
update your tests, but I will try to run it myself.

In general one can see individual stages of compilation on the graph -
parsing, early lowering, early opts.  On bigger programs this should be
more visible.  I will give it a try.

Honza


Re: [PATCH][RFC] Add new ipa-reorder pass

2019-12-09 Thread Martin Liška

On 12/9/19 1:14 PM, Martin Liška wrote:

Hello.

Based on presentation that had Sriraman Tallam at a LLVM conference:
https://www.youtube.com/watch?v=DySuXFGmB40

I made a heatmap based on executed instruction addresses. I used
$ perf record -F max -- ./cc1plus -fpreprocessed 
/home/marxin/Programming/tramp3d/tramp3d-v4.ii
and
$ perf script -F time,ip,dso

I'm sending link for my system GCC 9 (PGO+lean LTO bootstrap), GCC 10 before 
and after my reorder
patch (also PGO+lean LTO bootstrap).

One can see quite significant clustering starting from 5s till the end of 
compilation.
Link: https://drive.google.com/open?id=1M0YlxvQPyiVguy5VWRC8dG52UArwAuKS

Martin


For the completeness, the heatmap was generated with the following script:
https://github.com/marxin/script-misc/blob/master/binary-heatmap.py

Martin


Re: [PATCH][RFC] Add new ipa-reorder pass

2019-12-09 Thread Martin Liška

Hello.

Based on presentation that had Sriraman Tallam at a LLVM conference:
https://www.youtube.com/watch?v=DySuXFGmB40

I made a heatmap based on executed instruction addresses. I used
$ perf record -F max -- ./cc1plus -fpreprocessed 
/home/marxin/Programming/tramp3d/tramp3d-v4.ii
and
$ perf script -F time,ip,dso

I'm sending link for my system GCC 9 (PGO+lean LTO bootstrap), GCC 10 before 
and after my reorder
patch (also PGO+lean LTO bootstrap).

One can see quite significant clustering starting from 5s till the end of 
compilation.
Link: https://drive.google.com/open?id=1M0YlxvQPyiVguy5VWRC8dG52UArwAuKS

Martin


Re: [PATCH][RFC] Add new ipa-reorder pass

2019-12-02 Thread Martin Liška

On 12/1/19 11:37 PM, Jan Hubicka wrote:

Hi,
I was playing with it a bit more and built with
-fno-profile-reorder-functions.

Here is -fno-profile-reorder-functions compared to first run
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=3d537be0cb37458e7928f69a37efb2a6d6b85eae=try=4543abfa08870391544b56d16dfcd530dac0dc30=1
2.2% improvement on the page rendering is off noise but I would hope for
bit more.

Here is -fno-profile-reorder-functions compared to clustering
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=3d537be0cb37458e7928f69a37efb2a6d6b85eae=try=1c2d53b10b042c15edbe7bd26e2740641840=1
I am not sure if it is a noise.

Either Fireox's talos is not good benchmark for code layout (which I
would be surprised since it is quite sensitive to code size issues) or
there are some problems.

In general I think the patch is useful and mostly mainline ready except
for detailes but it would be good to have some more evidence that it
works as expected on large binaries besides tramp3d build time.  There
are number of ways where things may go wrong ranging from misupdated
profiles, problems with function splitting, comdats and other issues.


Based on my testing, I was able to see cc1plus binary really sorted as seen
by the pass, including various IPA clones that inherited order from their
origins.



Was you able to benchmark some other benefits?


Unfortunately not.


I remember we discussed
collecting traces from valgrind, perhaps we could test that they are
looking good?


I have some semi-working icegrind port here:
https://github.com/marxin/valgrind/tree/icegrind

It will probably need some extra work and it's terribly slow. But you can try
it.

I would first wait for the Firefox dump files and then we can discuss what
to do with the pass.

Martin



Honza





Re: [PATCH][RFC] Add new ipa-reorder pass

2019-12-02 Thread Martin Liška

On 12/1/19 11:45 AM, Jan Hubicka wrote:

Hi.

I'm sending v3 of the patch where I changed:
- function.cold sections are properly put into .text.unlikely and
   not into a .text.sorted.XYZ section

I've just finished measurements and I still have the original speed up
for tramp3d:
Total runs: 10, before: 13.92, after: 13.82, cmp: 99.219%


Hi,
I have updated binutils to current head on the Firefox testing patch and
run FDO build with tp first run ordering and call chain clustering.
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=1313e6a4d74ebff702afa7594684beb83c01d95f=try=1c2d53b10b042c15edbe7bd26e2740641840=1


Hello.

Thank you for the testing.



It seems there are no differences in performance. The two binaries can
be downloaded at

w/o patch:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MK-7DC3FQcevZC_Nvlnq8Q/runs/0/artifacts/public/build/target.tar.bz2
with call chain clustering.
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UVh6iNILT-qb8sYM5vxVCQ/runs/0/artifacts/public/build/target.tar.bz2


I would ideally need output of -fdump-ipa-reorder which prints sorted symbols 
and
so that I can compare it with resulting assembly.



Since Firefox is quite sensitive to code size I would expect to be able
to measure some benefits here.  Any idea what may have go wrong?


That's a pity.


I checked that the binaries seems generally sane - out of 58MB text
segment there is 34MB cold section. It is possible that system ld is
used instead of provided one, but that would be weird.  I will try to
find way to double-check that updating binutils really updated them for
GCC.


I can double check once having the dump file.

Martin



Honza





Re: [PATCH][RFC] Add new ipa-reorder pass

2019-12-02 Thread Martin Liška

On 12/1/19 12:11 PM, Jan Hubicka wrote:

Hi.

I'm sending v3 of the patch where I changed:
- function.cold sections are properly put into .text.unlikely and
   not into a .text.sorted.XYZ section

I've just finished measurements and I still have the original speed up
for tramp3d:
Total runs: 10, before: 13.92, after: 13.82, cmp: 99.219%


Hi,
I have updated binutils to current head on the Firefox testing patch and
run FDO build with tp first run ordering and call chain clustering.
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=1313e6a4d74ebff702afa7594684beb83c01d95f=try=1c2d53b10b042c15edbe7bd26e2740641840=1

It seems there are no differences in performance. The two binaries can
be downloaded at

w/o patch:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MK-7DC3FQcevZC_Nvlnq8Q/runs/0/artifacts/public/build/target.tar.bz2
with call chain clustering.
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UVh6iNILT-qb8sYM5vxVCQ/runs/0/artifacts/public/build/target.tar.bz2

Since Firefox is quite sensitive to code size I would expect to be able
to measure some benefits here.  Any idea what may have go wrong?
I checked that the binaries seems generally sane - out of 58MB text
segment there is 34MB cold section. It is possible that system ld is
used instead of provided one, but that would be weird.  I will try to
find way to double-check that updating binutils really updated them for
GCC.

Linker used is
GNU ld (GNU Binutils) 2.33.50.20191130


Hello.

You can check linker support with:
$ ~/bin/binutils/bin/ld.bfd --verbose | grep text.sorted
*(SORT(.text.sorted.*))

Is Firefox using BFD or GOLD for linking?



Does it have all necessary support in?  Do I need something like
-ffunction-sections?


No, you don't need it.

Martin



Honza





Re: [PATCH][RFC] Add new ipa-reorder pass

2019-12-01 Thread Jan Hubicka
Hi,
I was playing with it a bit more and built with
-fno-profile-reorder-functions.  

Here is -fno-profile-reorder-functions compared to first run 
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=3d537be0cb37458e7928f69a37efb2a6d6b85eae=try=4543abfa08870391544b56d16dfcd530dac0dc30=1
2.2% improvement on the page rendering is off noise but I would hope for
bit more.

Here is -fno-profile-reorder-functions compared to clustering
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=3d537be0cb37458e7928f69a37efb2a6d6b85eae=try=1c2d53b10b042c15edbe7bd26e2740641840=1
I am not sure if it is a noise.

Either Fireox's talos is not good benchmark for code layout (which I
would be surprised since it is quite sensitive to code size issues) or
there are some problems.

In general I think the patch is useful and mostly mainline ready except
for detailes but it would be good to have some more evidence that it
works as expected on large binaries besides tramp3d build time.  There
are number of ways where things may go wrong ranging from misupdated
profiles, problems with function splitting, comdats and other issues.

Was you able to benchmark some other benefits?  I remember we discussed
collecting traces from valgrind, perhaps we could test that they are
looking good?

Honza


Re: [PATCH][RFC] Add new ipa-reorder pass

2019-12-01 Thread Jan Hubicka
> > Hi.
> > 
> > I'm sending v3 of the patch where I changed:
> > - function.cold sections are properly put into .text.unlikely and
> >   not into a .text.sorted.XYZ section
> > 
> > I've just finished measurements and I still have the original speed up
> > for tramp3d:
> > Total runs: 10, before: 13.92, after: 13.82, cmp: 99.219%
> 
> Hi,
> I have updated binutils to current head on the Firefox testing patch and
> run FDO build with tp first run ordering and call chain clustering.
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=1313e6a4d74ebff702afa7594684beb83c01d95f=try=1c2d53b10b042c15edbe7bd26e2740641840=1
> 
> It seems there are no differences in performance. The two binaries can
> be downloaded at
> 
> w/o patch:
> https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MK-7DC3FQcevZC_Nvlnq8Q/runs/0/artifacts/public/build/target.tar.bz2
> with call chain clustering.
> https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UVh6iNILT-qb8sYM5vxVCQ/runs/0/artifacts/public/build/target.tar.bz2
> 
> Since Firefox is quite sensitive to code size I would expect to be able
> to measure some benefits here.  Any idea what may have go wrong?
> I checked that the binaries seems generally sane - out of 58MB text
> segment there is 34MB cold section. It is possible that system ld is
> used instead of provided one, but that would be weird.  I will try to
> find way to double-check that updating binutils really updated them for
> GCC.
Linker used is
GNU ld (GNU Binutils) 2.33.50.20191130

Does it have all necessary support in?  Do I need something like
-ffunction-sections?

Honza


Re: [PATCH][RFC] Add new ipa-reorder pass

2019-12-01 Thread Jan Hubicka
> Hi.
> 
> I'm sending v3 of the patch where I changed:
> - function.cold sections are properly put into .text.unlikely and
>   not into a .text.sorted.XYZ section
> 
> I've just finished measurements and I still have the original speed up
> for tramp3d:
> Total runs: 10, before: 13.92, after: 13.82, cmp: 99.219%

Hi,
I have updated binutils to current head on the Firefox testing patch and
run FDO build with tp first run ordering and call chain clustering.
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try=1313e6a4d74ebff702afa7594684beb83c01d95f=try=1c2d53b10b042c15edbe7bd26e2740641840=1

It seems there are no differences in performance. The two binaries can
be downloaded at

w/o patch:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/MK-7DC3FQcevZC_Nvlnq8Q/runs/0/artifacts/public/build/target.tar.bz2
with call chain clustering.
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/UVh6iNILT-qb8sYM5vxVCQ/runs/0/artifacts/public/build/target.tar.bz2

Since Firefox is quite sensitive to code size I would expect to be able
to measure some benefits here.  Any idea what may have go wrong?
I checked that the binaries seems generally sane - out of 58MB text
segment there is 34MB cold section. It is possible that system ld is
used instead of provided one, but that would be weird.  I will try to
find way to double-check that updating binutils really updated them for
GCC.

Honza


Re: [PATCH][RFC] Add new ipa-reorder pass

2019-11-29 Thread Martin Liška

Hi.

I'm sending v3 of the patch where I changed:
- function.cold sections are properly put into .text.unlikely and
  not into a .text.sorted.XYZ section

I've just finished measurements and I still have the original speed up
for tramp3d:
Total runs: 10, before: 13.92, after: 13.82, cmp: 99.219%

Thoughts?
Martin

>From f3fd85c29a4a2746555294bd30e3c31129030074 Mon Sep 17 00:00:00 2001
From: Martin Liska 
Date: Thu, 5 Sep 2019 13:32:41 +0200
Subject: [PATCH] Add new ipa-reorder pass.

gcc/ChangeLog:

2019-11-25  Martin Liska  

	* Makefile.in: Add ipa-reorder.o.
	* cgraph.c (cgraph_node::dump): Dump
	text_sorted_order.
	(cgraph_node_cmp_by_text_sorted):
	New function that sorts functions based
	on text_sorted_order.
	* cgraph.h (cgraph_node): Add text_sorted_order.
	(cgraph_node_cmp_by_text_sorted): New.
	* cgraphclones.c (cgraph_node::create_clone):
	Clone also text_sorted_order.
	* cgraphunit.c (node_cmp): Remove.
	(expand_all_functions): Use new function
	cgraph_node_cmp_by_text_sorted.
	* common.opt: Add new option reorder_functions_algorithm.
	* flag-types.h (enum reorder_functions_algorithm):
	New enum.
	* ipa-reorder.c: New file.
	* lto-cgraph.c (lto_output_node): Stream in and out
	text_sorted_order.
	(input_node): Likewise.
	* passes.def: Add pass_ipa_reorder.
	* timevar.def (TV_IPA_REORDER): New.
	* tree-pass.h (make_pass_ipa_reorder): New.
	* varasm.c (default_function_section): Assign text.sorted.X
	section.

gcc/lto/ChangeLog:

2019-11-25  Martin Liska  

	* lto-partition.c (node_cmp): Remove.
	(lto_balanced_map): Use new cgraph_node_cmp_by_text_sorted.
---
 gcc/Makefile.in |   1 +
 gcc/cgraph.c|  20 +++
 gcc/cgraph.h|   3 +
 gcc/cgraphclones.c  |   1 +
 gcc/cgraphunit.c|  21 +--
 gcc/common.opt  |  14 ++
 gcc/flag-types.h|   8 +
 gcc/ipa-reorder.c   | 383 
 gcc/lto-cgraph.c|   2 +
 gcc/lto/lto-partition.c |  33 +---
 gcc/passes.def  |   1 +
 gcc/timevar.def |   1 +
 gcc/tree-pass.h |   1 +
 gcc/varasm.c|   9 +
 14 files changed, 448 insertions(+), 50 deletions(-)
 create mode 100644 gcc/ipa-reorder.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 7d3c13230e4..163d47c47f1 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1369,6 +1369,7 @@ OBJS = \
 	init-regs.o \
 	internal-fn.o \
 	ipa-cp.o \
+	ipa-reorder.o \
 	ipa-sra.o \
 	ipa-devirt.o \
 	ipa-fnsummary.o \
diff --git a/gcc/cgraph.c b/gcc/cgraph.c
index 180d21e4796..9c09d63e4ee 100644
--- a/gcc/cgraph.c
+++ b/gcc/cgraph.c
@@ -1951,6 +1951,8 @@ cgraph_node::dump (FILE *f)
 }
   if (tp_first_run > 0)
 fprintf (f, " first_run:%i", tp_first_run);
+  if (text_sorted_order > 0)
+fprintf (f, " text_sorted_order:%i", text_sorted_order);
   if (origin)
 fprintf (f, " nested in:%s", origin->asm_name ());
   if (gimple_has_body_p (decl))
@@ -3718,6 +3720,24 @@ cgraph_edge::possibly_call_in_translation_unit_p (void)
   return node->get_availability () >= AVAIL_INTERPOSABLE;
 }
 
+/* Sort cgraph_nodes by text_sorted_order if available, or by order.  */
+
+int
+cgraph_node_cmp_by_text_sorted (const void *pa, const void *pb)
+{
+  const cgraph_node *a = *(const cgraph_node * const *) pa;
+  const cgraph_node *b = *(const cgraph_node * const *) pb;
+
+  /* Functions with text_sorted_order should be before these
+ without profile.  */
+  if (a->text_sorted_order == 0 || b->text_sorted_order == 0)
+return a->text_sorted_order - b->text_sorted_order;
+
+  return a->text_sorted_order != b->text_sorted_order
+	 ? b->text_sorted_order - a->text_sorted_order
+	 : b->order - a->order;
+}
+
 /* A stashed copy of "symtab" for use by selftest::symbol_table_test.
This needs to be a global so that it can be a GC root, and thus
prevent the stashed copy from being garbage-collected if the GC runs
diff --git a/gcc/cgraph.h b/gcc/cgraph.h
index 0d2442c997c..039819db358 100644
--- a/gcc/cgraph.h
+++ b/gcc/cgraph.h
@@ -1435,6 +1435,8 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : public symtab_node
   unsigned int profile_id;
   /* Time profiler: first run of function.  */
   int tp_first_run;
+  /* Order in .text.sorted.* section.  */
+  int text_sorted_order;
 
   /* Set when decl is an abstract function pointed to by the
  ABSTRACT_DECL_ORIGIN of a reachable function.  */
@@ -2451,6 +2453,7 @@ bool cgraph_function_possibly_inlined_p (tree);
 
 const char* cgraph_inline_failed_string (cgraph_inline_failed_t);
 cgraph_inline_failed_type_t cgraph_inline_failed_type (cgraph_inline_failed_t);
+int cgraph_node_cmp_by_text_sorted (const void *pa, const void *pb);
 
 /* In cgraphunit.c  */
 void cgraphunit_c_finalize (void);
diff --git a/gcc/cgraphclones.c b/gcc/cgraphclones.c
index a79491e0b88..f624cbee185 100644
--- a/gcc/cgraphclones.c
+++ b/gcc/cgraphclones.c
@@ -372,6 +372,7 @@ cgraph_node::create_clone (tree new_decl, profile_count prof_count,
   new_node->rtl = rtl;
   

Re: [PATCH][RFC] Add new ipa-reorder pass

2019-11-25 Thread Martin Liška

Hello.

I'm sending v2 of the patch set based on the discussion I had with Honza.
Changes from previous version:
- I changed type of edge count from uint32_t to uint64_t.
- The algorithm traverses recursively inline clones.
- TDF_DUMP_DETAILS is supported and provides more information.
- I added cgraph_node_cmp_by_text_sorted function that is used
  for symbol sorting both in cgraphunit.c and lto-cgraph.c.

I made the same testing as before I get following numbers of tramp3d w/ -O2:
Total runs: 15, before: 13.89, after: 13.81, cmp: 99.383%

and iTLB-load-misses:u goes from 15,450,298 to 11,633,215.

The corresponding binutils patch part was partially accepted (for BFD) and
today I was given a feedback for ld.gold.

Thoughts?
Thanks,
Martin
>From 9d62da95dc8efd4194565fca6471b3ddb6109c9b Mon Sep 17 00:00:00 2001
From: Martin Liska 
Date: Thu, 5 Sep 2019 13:32:41 +0200
Subject: [PATCH] Add new ipa-reorder pass.

gcc/ChangeLog:

2019-11-25  Martin Liska  

	* Makefile.in: Add ipa-reorder.o.
	* cgraph.c (cgraph_node::dump): Dump
	text_sorted_order.
	(cgraph_node_cmp_by_text_sorted):
	New function that sorts functions based
	on text_sorted_order.
	* cgraph.h (cgraph_node): Add text_sorted_order.
	(cgraph_node_cmp_by_text_sorted): New.
	* cgraphclones.c (cgraph_node::create_clone):
	Clone also text_sorted_order.
	* cgraphunit.c (node_cmp): Remove.
	(expand_all_functions): Use new function
	cgraph_node_cmp_by_text_sorted.
	* common.opt: Add new option reorder_functions_algorithm.
	* flag-types.h (enum reorder_functions_algorithm):
	New enum.
	* ipa-reorder.c: New file.
	* lto-cgraph.c (lto_output_node): Stream in and out
	text_sorted_order.
	(input_node): Likewise.
	* passes.def: Add pass_ipa_reorder.
	* timevar.def (TV_IPA_REORDER): New.
	* tree-pass.h (make_pass_ipa_reorder): New.
	* varasm.c (default_function_section): Assign text.sorted.X
	section.

gcc/lto/ChangeLog:

2019-11-25  Martin Liska  

	* lto-partition.c (node_cmp): Remove.
	(lto_balanced_map): Use new cgraph_node_cmp_by_text_sorted.
---
 gcc/Makefile.in |   1 +
 gcc/cgraph.c|  20 +++
 gcc/cgraph.h|   3 +
 gcc/cgraphclones.c  |   1 +
 gcc/cgraphunit.c|  21 +--
 gcc/common.opt  |  14 ++
 gcc/flag-types.h|   8 +
 gcc/ipa-reorder.c   | 383 
 gcc/lto-cgraph.c|   2 +
 gcc/lto/lto-partition.c |  33 +---
 gcc/passes.def  |   1 +
 gcc/timevar.def |   1 +
 gcc/tree-pass.h |   1 +
 gcc/varasm.c|   9 +
 14 files changed, 448 insertions(+), 50 deletions(-)
 create mode 100644 gcc/ipa-reorder.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 7d3c13230e4..163d47c47f1 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1369,6 +1369,7 @@ OBJS = \
 	init-regs.o \
 	internal-fn.o \
 	ipa-cp.o \
+	ipa-reorder.o \
 	ipa-sra.o \
 	ipa-devirt.o \
 	ipa-fnsummary.o \
diff --git a/gcc/cgraph.c b/gcc/cgraph.c
index 1f7a5c58d98..0e31582d2ce 100644
--- a/gcc/cgraph.c
+++ b/gcc/cgraph.c
@@ -1951,6 +1951,8 @@ cgraph_node::dump (FILE *f)
 }
   if (tp_first_run > 0)
 fprintf (f, " first_run:%i", tp_first_run);
+  if (text_sorted_order > 0)
+fprintf (f, " text_sorted_order:%i", text_sorted_order);
   if (origin)
 fprintf (f, " nested in:%s", origin->asm_name ());
   if (gimple_has_body_p (decl))
@@ -3696,6 +3698,24 @@ cgraph_edge::possibly_call_in_translation_unit_p (void)
   return node->get_availability () >= AVAIL_INTERPOSABLE;
 }
 
+/* Sort cgraph_nodes by text_sorted_order if available, or by order.  */
+
+int
+cgraph_node_cmp_by_text_sorted (const void *pa, const void *pb)
+{
+  const cgraph_node *a = *(const cgraph_node * const *) pa;
+  const cgraph_node *b = *(const cgraph_node * const *) pb;
+
+  /* Functions with text_sorted_order should be before these
+ without profile.  */
+  if (a->text_sorted_order == 0 || b->text_sorted_order == 0)
+return a->text_sorted_order - b->text_sorted_order;
+
+  return a->text_sorted_order != b->text_sorted_order
+	 ? b->text_sorted_order - a->text_sorted_order
+	 : b->order - a->order;
+}
+
 /* A stashed copy of "symtab" for use by selftest::symbol_table_test.
This needs to be a global so that it can be a GC root, and thus
prevent the stashed copy from being garbage-collected if the GC runs
diff --git a/gcc/cgraph.h b/gcc/cgraph.h
index a4f14743f00..ad6ad7ef513 100644
--- a/gcc/cgraph.h
+++ b/gcc/cgraph.h
@@ -1431,6 +1431,8 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : public symtab_node
   unsigned int profile_id;
   /* Time profiler: first run of function.  */
   int tp_first_run;
+  /* Order in .text.sorted.* section.  */
+  int text_sorted_order;
 
   /* Set when decl is an abstract function pointed to by the
  ABSTRACT_DECL_ORIGIN of a reachable function.  */
@@ -2447,6 +2449,7 @@ bool cgraph_function_possibly_inlined_p (tree);
 
 const char* cgraph_inline_failed_string (cgraph_inline_failed_t);
 cgraph_inline_failed_type_t 

Re: [PATCH][RFC] Add new ipa-reorder pass

2019-10-07 Thread Jan Hubicka
> > This is why we currently have way to order function when outputting them
> > and use that with FDO (using Martin's first execution logic). This has
> > drwarback of making the functions to flow in that order through late
> > optimizations and RTL backend and thus we lose IPA-RA and some
> > IP propagation (late pure/const/nothrow discovery).
> 
> But you can also fix that by the parallelization GSoC project approach,
> decoupling things at RTL expansion IPA-wise and using output order
> for the latter (or even do the fragments in GCC itself by refactoring the
> output machinery to use function-local "strings", assembling the final
> file later).  Refactoring output to handle multiple output "files" at the
> same time might also help getting rid of that early-LTO-debug copying
> stuff (now that it seems that linker-plugin extensions for partially claiming
> files will never happen...)

Multiple output files won't solve my concern about padding and
relaxation, but yep, parallelizing RTL will definitly need to introduce
way to hold final assembly and glue it together, so this problem may get
solved independently which would be nice.

Honza


Re: [PATCH][RFC] Add new ipa-reorder pass

2019-10-07 Thread Richard Biener
On Sun, Oct 6, 2019 at 4:38 PM Jan Hubicka  wrote:
>
> > On 9/19/19 2:33 AM, Martin Liška wrote:
> > > Hi.
> > >
> > > Function reordering has been around for quite some time and a naive
> > > implementation was also part of my diploma thesis some time ago.
> > > Currently, the GCC can reorder function based on first execution, which
> > > happens with PGO and LTO of course. Known limitation is that the order
> > > is preserved only partially as various symbols go into different LTRANS 
> > > partitions.
> > >
> > > There has been some research in the area and I would point out the 
> > > Facebook paper
> > > ([1]) and Sony presentation ([2]). Based on that, I decided to make a new 
> > > implementation
> > > in the GCC that does the same (in a proper way). First part of the 
> > > enablement are patches
> > > to ld.bfd and ld.gold that come up with a new section .text.sorted, that 
> > > is always sorted.
> > > There's a snippet from the modified default linker script:
> > Funny, I was doing this via linker scripts circa ~95, in fact that's why
> > we have -ffunction-sections :-)   We started with scripts which
> > post-processed profiling data to create linker scripts for ELF systems.
> >  We had something for HPUX/SOM as well, but I can't remember what
> > mechanism we used, it may have been a gross level sorting using the SOM
> > section sort key mechanism - it only had 128 or 256 keys with a notable
> > amount of them reserved.
> >
> > We had also built a linker with a basic level of interposition circa
> > 1993 and explored various approaches to reordering executables.  I'd
> > just joined the group at the time and was responsible for wiring up
> > stuff on the OS side, but eventually got the "pleasure" of owning the
> > linker server.  A lot of the C3 algorithmic stuff looks similar to what
> > we did.
>
> For reference I am attaching early LTO version from circa 2010 :)
> >
> > Anyway...
> >
> > I don't see anything objectionable in here.  It's noted as an RFC.  Are
> > you interested in pushing this forward for gcc-10?
>
> I think it is plan to get some form of code layout pass into GCC10.  I
> will test Martin's version on Firefox and see if it have any effect
> here. It is generally quite hard to evaluate those changes (and it is
> reason why I did not moved forward with version form 2010 - more
> precisely it kept falling off the shelf for about a decade)
>
> If LLD has support for cgraph annotations and we support LLD, i think we
> should add support for that, too - it will be very useful to compare
> indvidiual implementations.
> I believe there is gold support (off tree?) for that too and something
> similar is also supported by other toolchains like AIX?
>
> One problem with the sections approach is that every section gets
> aligned to largest code alignment inside of the section. Since our
> alignments are (should be) often cache line based we get cca 30 bytes of
> waste for every function that is quite a lot.
>
> This is why we currently have way to order function when outputting them
> and use that with FDO (using Martin's first execution logic). This has
> drwarback of making the functions to flow in that order through late
> optimizations and RTL backend and thus we lose IPA-RA and some
> IP propagation (late pure/const/nothrow discovery).

But you can also fix that by the parallelization GSoC project approach,
decoupling things at RTL expansion IPA-wise and using output order
for the latter (or even do the fragments in GCC itself by refactoring the
output machinery to use function-local "strings", assembling the final
file later).  Refactoring output to handle multiple output "files" at the
same time might also help getting rid of that early-LTO-debug copying
stuff (now that it seems that linker-plugin extensions for partially claiming
files will never happen...)

> So this approach has a drawback, too. It is why i was trying to push
> myself to get gcc to use gas fragments :)
>
> Anyway, all these issues can be sovled incementally - lets see how
> Maritn's patch work on Firefox and if we can get it tested elsewhere and
> start from that.
>
> I will take a look into the details.
>
> Honza
> >
> > jeff
> >
>
> Index: tree-pass.h
> ===
> *** tree-pass.h (revision 164689)
> --- tree-pass.h (working copy)
> *** extern struct ipa_opt_pass_d pass_ipa_lt
> *** 467,472 
> --- 467,473 
>   extern struct ipa_opt_pass_d pass_ipa_lto_finish_out;
>   extern struct ipa_opt_pass_d pass_ipa_profile;
>   extern struct ipa_opt_pass_d pass_ipa_cdtor_merge;
> + extern struct ipa_opt_pass_d pass_ipa_func_reorder;
>
>   extern struct gimple_opt_pass pass_all_optimizations;
>   extern struct gimple_opt_pass pass_cleanup_cfg_post_optimizing;
> Index: cgraphunit.c
> ===
> *** cgraphunit.c(revision 164689)
> --- cgraphunit.c(working copy)
> 

Re: [PATCH][RFC] Add new ipa-reorder pass

2019-10-06 Thread Jan Hubicka
> On 9/19/19 2:33 AM, Martin Liška wrote:
> > Hi.
> > 
> > Function reordering has been around for quite some time and a naive
> > implementation was also part of my diploma thesis some time ago.
> > Currently, the GCC can reorder function based on first execution, which
> > happens with PGO and LTO of course. Known limitation is that the order
> > is preserved only partially as various symbols go into different LTRANS 
> > partitions.
> > 
> > There has been some research in the area and I would point out the Facebook 
> > paper
> > ([1]) and Sony presentation ([2]). Based on that, I decided to make a new 
> > implementation
> > in the GCC that does the same (in a proper way). First part of the 
> > enablement are patches
> > to ld.bfd and ld.gold that come up with a new section .text.sorted, that is 
> > always sorted.
> > There's a snippet from the modified default linker script:
> Funny, I was doing this via linker scripts circa ~95, in fact that's why
> we have -ffunction-sections :-)   We started with scripts which
> post-processed profiling data to create linker scripts for ELF systems.
>  We had something for HPUX/SOM as well, but I can't remember what
> mechanism we used, it may have been a gross level sorting using the SOM
> section sort key mechanism - it only had 128 or 256 keys with a notable
> amount of them reserved.
> 
> We had also built a linker with a basic level of interposition circa
> 1993 and explored various approaches to reordering executables.  I'd
> just joined the group at the time and was responsible for wiring up
> stuff on the OS side, but eventually got the "pleasure" of owning the
> linker server.  A lot of the C3 algorithmic stuff looks similar to what
> we did.

For reference I am attaching early LTO version from circa 2010 :)
> 
> Anyway...
> 
> I don't see anything objectionable in here.  It's noted as an RFC.  Are
> you interested in pushing this forward for gcc-10?

I think it is plan to get some form of code layout pass into GCC10.  I
will test Martin's version on Firefox and see if it have any effect
here. It is generally quite hard to evaluate those changes (and it is
reason why I did not moved forward with version form 2010 - more
precisely it kept falling off the shelf for about a decade)

If LLD has support for cgraph annotations and we support LLD, i think we
should add support for that, too - it will be very useful to compare
indvidiual implementations.
I believe there is gold support (off tree?) for that too and something
similar is also supported by other toolchains like AIX?

One problem with the sections approach is that every section gets
aligned to largest code alignment inside of the section. Since our
alignments are (should be) often cache line based we get cca 30 bytes of
waste for every function that is quite a lot.

This is why we currently have way to order function when outputting them
and use that with FDO (using Martin's first execution logic). This has
drwarback of making the functions to flow in that order through late
optimizations and RTL backend and thus we lose IPA-RA and some 
IP propagation (late pure/const/nothrow discovery).
So this approach has a drawback, too. It is why i was trying to push
myself to get gcc to use gas fragments :)

Anyway, all these issues can be sovled incementally - lets see how
Maritn's patch work on Firefox and if we can get it tested elsewhere and
start from that.

I will take a look into the details.

Honza
> 
> jeff
> 

Index: tree-pass.h
===
*** tree-pass.h (revision 164689)
--- tree-pass.h (working copy)
*** extern struct ipa_opt_pass_d pass_ipa_lt
*** 467,472 
--- 467,473 
  extern struct ipa_opt_pass_d pass_ipa_lto_finish_out;
  extern struct ipa_opt_pass_d pass_ipa_profile;
  extern struct ipa_opt_pass_d pass_ipa_cdtor_merge;
+ extern struct ipa_opt_pass_d pass_ipa_func_reorder;
  
  extern struct gimple_opt_pass pass_all_optimizations;
  extern struct gimple_opt_pass pass_cleanup_cfg_post_optimizing;
Index: cgraphunit.c
===
*** cgraphunit.c(revision 164689)
--- cgraphunit.c(working copy)
*** cgraph_inline_p (struct cgraph_edge *e,
*** 1517,1522 
--- 1517,1529 
return !e->inline_failed;
  }
  
+ static int
+ node_cmp (const void *pa, const void *pb)
+ {
+   const struct cgraph_node *a = *(const struct cgraph_node * const *) pa;
+   const struct cgraph_node *b = *(const struct cgraph_node * const *) pb;
+   return b->order - a->order;
+ }
  
  
  /* Expand all functions that must be output.
*** cgraph_expand_all_functions (void)
*** 1546,1551 
--- 1553,1561 
  if (order[i]->process)
order[new_order_pos++] = order[i];
  
+   if (flag_ipa_reorder)
+ qsort (order, new_order_pos, sizeof (struct cgraph_node *), node_cmp);
+ 
for (i = new_order_pos - 1; i >= 0; i--)
  {
node 

Re: [PATCH][RFC] Add new ipa-reorder pass

2019-10-04 Thread Jeff Law
On 9/19/19 2:33 AM, Martin Liška wrote:
> Hi.
> 
> Function reordering has been around for quite some time and a naive
> implementation was also part of my diploma thesis some time ago.
> Currently, the GCC can reorder function based on first execution, which
> happens with PGO and LTO of course. Known limitation is that the order
> is preserved only partially as various symbols go into different LTRANS 
> partitions.
> 
> There has been some research in the area and I would point out the Facebook 
> paper
> ([1]) and Sony presentation ([2]). Based on that, I decided to make a new 
> implementation
> in the GCC that does the same (in a proper way). First part of the enablement 
> are patches
> to ld.bfd and ld.gold that come up with a new section .text.sorted, that is 
> always sorted.
> There's a snippet from the modified default linker script:
Funny, I was doing this via linker scripts circa ~95, in fact that's why
we have -ffunction-sections :-)   We started with scripts which
post-processed profiling data to create linker scripts for ELF systems.
 We had something for HPUX/SOM as well, but I can't remember what
mechanism we used, it may have been a gross level sorting using the SOM
section sort key mechanism - it only had 128 or 256 keys with a notable
amount of them reserved.

We had also built a linker with a basic level of interposition circa
1993 and explored various approaches to reordering executables.  I'd
just joined the group at the time and was responsible for wiring up
stuff on the OS side, but eventually got the "pleasure" of owning the
linker server.  A lot of the C3 algorithmic stuff looks similar to what
we did.

Anyway...

I don't see anything objectionable in here.  It's noted as an RFC.  Are
you interested in pushing this forward for gcc-10?

jeff



Re: [PATCH][RFC] Add new ipa-reorder pass

2019-10-03 Thread Andrew Pinski
On Thu, Oct 3, 2019 at 12:46 AM Fangrui Song  wrote:
>
>
> On 2019-10-03, Andrew Pinski wrote:
> >On Wed, Oct 2, 2019@9:52 PM Fangrui Song  wrote:
> >>
> >> On 2019-09-24, Martin Liška wrote:
> >> >On 9/19/19 10:33 AM, Martin Liška wrote:
> >> >> - One needs modified binutils and I that would probably require a 
> >> >> configure detection. The only way
> >> >>   which I see is based on ld --version. I'm planning to make the 
> >> >> binutils submission soon.
> >> >
> >> >The patch submission link:
> >> >https://sourceware.org/ml/binutils/2019-09/msg00219.html
> >>
> >> Hi Martin,
> >>
> >> I have a question about why .text.sorted.* are needed.
> >>
> >> The Sony presentation (your [2] link) embedded a new section
> >> .llvm.call-graph-profile[3] to represent edges in the object files. The
> >> linker (lld) collects all .llvm.call-graph-profile sections and does a
> >> C3 layout. There is no need for new section type .text.sorted.*
> >>
> >> [3]: 
> >> https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s
> >>
> >> (Please CC me. I am not subscribed.)
> >
> >The idea of GCC's version is to that the modification needed to the
> >linker is very little.  And even without patching the linker
> >script/linker, you don't need to much special and you don't need LD to
> >do much work@all.
>
> I am afraid this can be a large limitation. Then .text.sorted.* can
> only a) reorder functions within a translation unit, b) or reorder all
> functions when LTO is enabled.

I don't think it is that limited.  In fact I think the other way
around is more limited as you are now limited on changing the call
graph data format.  This reduces the complexity of the linker really.
Linkers are already slow; why make them even slower?  This patch is
you only need to change the linker once and you don't have
dependencies between the linker and compiler.

Thanks,
Andrew

>
> b) is possible only if all .text.sorted.* sections can be numbered.
>
> For the LLVM case, call graph profile can be used without LTO. When both
> ThinLTO+PGO are enabled, however, the additional performance improvement
> offered by the global call graph profile reordering is insignificant,
> smaller than 1%.See you said it was less than 1% so why do it then?

Thanks,
Andrew Pinski


Re: [PATCH][RFC] Add new ipa-reorder pass

2019-10-03 Thread Fangrui Song



On 2019-10-03, Andrew Pinski wrote:

On Wed, Oct 2, 2019@9:52 PM Fangrui Song  wrote:


On 2019-09-24, Martin Liška wrote:
>On 9/19/19 10:33 AM, Martin Liška wrote:
>> - One needs modified binutils and I that would probably require a configure 
detection. The only way
>>   which I see is based on ld --version. I'm planning to make the binutils 
submission soon.
>
>The patch submission link:
>https://sourceware.org/ml/binutils/2019-09/msg00219.html

Hi Martin,

I have a question about why .text.sorted.* are needed.

The Sony presentation (your [2] link) embedded a new section
.llvm.call-graph-profile[3] to represent edges in the object files. The
linker (lld) collects all .llvm.call-graph-profile sections and does a
C3 layout. There is no need for new section type .text.sorted.*

[3]: 
https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s

(Please CC me. I am not subscribed.)


The idea of GCC's version is to that the modification needed to the
linker is very little.  And even without patching the linker
script/linker, you don't need to much special and you don't need LD to
do much work@all.


I am afraid this can be a large limitation. Then .text.sorted.* can
only a) reorder functions within a translation unit, b) or reorder all
functions when LTO is enabled.

b) is possible only if all .text.sorted.* sections can be numbered.

For the LLVM case, call graph profile can be used without LTO. When both
ThinLTO+PGO are enabled, however, the additional performance improvement
offered by the global call graph profile reordering is insignificant,
smaller than 1%.


Re: [PATCH][RFC] Add new ipa-reorder pass

2019-10-03 Thread Andrew Pinski
On Wed, Oct 2, 2019 at 9:52 PM Fangrui Song  wrote:
>
> On 2019-09-24, Martin Liška wrote:
> >On 9/19/19 10:33 AM, Martin Liška wrote:
> >> - One needs modified binutils and I that would probably require a 
> >> configure detection. The only way
> >>   which I see is based on ld --version. I'm planning to make the binutils 
> >> submission soon.
> >
> >The patch submission link:
> >https://sourceware.org/ml/binutils/2019-09/msg00219.html
>
> Hi Martin,
>
> I have a question about why .text.sorted.* are needed.
>
> The Sony presentation (your [2] link) embedded a new section
> .llvm.call-graph-profile[3] to represent edges in the object files. The
> linker (lld) collects all .llvm.call-graph-profile sections and does a
> C3 layout. There is no need for new section type .text.sorted.*
>
> [3]: 
> https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s
>
> (Please CC me. I am not subscribed.)

The idea of GCC's version is to that the modification needed to the
linker is very little.  And even without patching the linker
script/linker, you don't need to much special and you don't need LD to
do much work at all.

Thanks,
Andrew


Re: [PATCH][RFC] Add new ipa-reorder pass

2019-10-02 Thread Fangrui Song

On 2019-09-24, Martin Liška wrote:

On 9/19/19 10:33 AM, Martin Liška wrote:

- One needs modified binutils and I that would probably require a configure 
detection. The only way
  which I see is based on ld --version. I'm planning to make the binutils 
submission soon.


The patch submission link:
https://sourceware.org/ml/binutils/2019-09/msg00219.html


Hi Martin,

I have a question about why .text.sorted.* are needed.

The Sony presentation (your [2] link) embedded a new section
.llvm.call-graph-profile[3] to represent edges in the object files. The
linker (lld) collects all .llvm.call-graph-profile sections and does a
C3 layout. There is no need for new section type .text.sorted.*

[3]: 
https://github.com/llvm/llvm-project/blob/master/lld/test/ELF/cgprofile-obj.s

(Please CC me. I am not subscribed.)


Re: [PATCH][RFC] Add new ipa-reorder pass

2019-09-26 Thread Martin Liška
On 9/25/19 6:36 PM, Evgeny Kudryashov wrote:
> On 2019-09-19 11:33, Martin Liška wrote:
>> Hi.
>>
>> Function reordering has been around for quite some time and a naive
>> implementation was also part of my diploma thesis some time ago.
>> Currently, the GCC can reorder function based on first execution, which
>> happens with PGO and LTO of course. Known limitation is that the order
>> is preserved only partially as various symbols go into different
>> LTRANS partitions.
>>
>> There has been some research in the area and I would point out the
>> Facebook paper
>> ([1]) and Sony presentation ([2]). Based on that, I decided to make a
>> new implementation
>> in the GCC that does the same (in a proper way). First part of the
>> enablement are patches
>> to ld.bfd and ld.gold that come up with a new section .text.sorted,
>> that is always sorted.
>>
>> Thoughts? I would definitely welcome any interesting measurement on a
>> bigger load.
>>
>> Martin
>>
> 
> Hi, Martin!
> 
> Some time ago I tried to do the same but didn't go that far.

Hello.

> 
> I also used the C3 algorithm, except for the fact that ipa_fn_summary 
> contains information about size and time (somehow missed it).

Which is a key part of the algorithm as one wants not to cross a page size 
boundary (in ideal case).


> The linker option --sort-section=name was used for prototyping. It, 
> obviously, sorts sections and allows to place functions in the desired order 
> (by placing them into named sections .text.sorted., without patching 
> linkers or adjusting linker scripts).

Yes, that can work, but having the new section explicitly sorted will not 
require a passing of an option to linker. 

> For testing my implementation I used several benchmarks from SPEC2006: 
> perlbench, sjeng and gobmk. Unfortunately, no significant positive changes 
> were obtained.
> 
> I've tested the proposed pass on perlbench with train and reference input 
> (PGO+LTO as a base) but couldn't obtain stable results (still unclear 
> environment or perlbench specific reasons).

Yes, it's quite difficult to measure something. One needs a huge .text section 
of a binary and a test-case which has a flat profile.
I was able to measure results on the gcc itself.

Martin

> 
> Evgeny.
> 



Re: [PATCH][RFC] Add new ipa-reorder pass

2019-09-25 Thread Evgeny Kudryashov

On 2019-09-19 11:33, Martin Liška wrote:

Hi.

Function reordering has been around for quite some time and a naive
implementation was also part of my diploma thesis some time ago.
Currently, the GCC can reorder function based on first execution, which
happens with PGO and LTO of course. Known limitation is that the order
is preserved only partially as various symbols go into different
LTRANS partitions.

There has been some research in the area and I would point out the
Facebook paper
([1]) and Sony presentation ([2]). Based on that, I decided to make a
new implementation
in the GCC that does the same (in a proper way). First part of the
enablement are patches
to ld.bfd and ld.gold that come up with a new section .text.sorted,
that is always sorted.

Thoughts? I would definitely welcome any interesting measurement on a
bigger load.

Martin



Hi, Martin!

Some time ago I tried to do the same but didn't go that far.

I also used the C3 algorithm, except for the fact that ipa_fn_summary 
contains information about size and time (somehow missed it). The linker 
option --sort-section=name was used for prototyping. It, obviously, 
sorts sections and allows to place functions in the desired order (by 
placing them into named sections .text.sorted., without patching 
linkers or adjusting linker scripts). For testing my implementation I 
used several benchmarks from SPEC2006: perlbench, sjeng and gobmk. 
Unfortunately, no significant positive changes were obtained.


I've tested the proposed pass on perlbench with train and reference 
input (PGO+LTO as a base) but couldn't obtain stable results (still 
unclear environment or perlbench specific reasons).


Evgeny.



Re: [PATCH][RFC] Add new ipa-reorder pass

2019-09-24 Thread Martin Liška
On 9/19/19 10:33 AM, Martin Liška wrote:
> - One needs modified binutils and I that would probably require a configure 
> detection. The only way
>   which I see is based on ld --version. I'm planning to make the binutils 
> submission soon.

The patch submission link:
https://sourceware.org/ml/binutils/2019-09/msg00219.html