Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
On Tue, Jan 15, 2019 at 1:10 AM Andi Kleen wrote: > > On Mon, Jan 14, 2019 at 04:15:20PM +0800, Bin.Cheng wrote: > > On Mon, Jan 14, 2019 at 4:07 PM Andi Kleen wrote: > > > > > > Bin Cheng, > > > > > > I did some testing on this now. The attached patch automatically > > > increases the iterations > > > for autofdo profiles. > > Hi Andi, thanks very much for tuning these. > > > > > > But even with even more iterations I still have stable failures in > > > > > > FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler foo[._]+cold > > > FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler size[ > > > \ta-zA-Z0-0]+foo[._]+cold > > I think these two are supposed to fail with current code base. > > Sorry for being late. > We should mark it as XFAIL then I guess. Yeah > > Is it understood why it doesn't work? I didn't dig into it, but I think the reason is autofdo::zero count is not considered as cold as profile count. It's kind of reasonable, otherwise too much code would be categorized as cold. > > > > FAIL: gcc.dg/tree-prof/indir-call-prof.c scan-ipa-dump afdo "Indirect > > > call -> direct call.* a1 transformation on insn" > > I also got unstable pass/fail for indirect call optimization when > > tuning iterations, and haven't got an iteration number which passes > > all the time. I guess we need to combine decreasing of sampling count > > here. > > Okay I will look into that. > > Could also try if prime sample after values help, this sometimes fixes > problems with systematically missing some code in sampling. > > > > FAIL: gcc.dg/tree-prof/peel-1.c scan-tree-dump cunroll "Peeled loop ., 1 > > > times" > > This one should fail too. > > Same. Maybe I shouldn't use words "should fail", but the reason is we don't consider autofdo profile count as reliable in niters analysis. Maybe this can be improved somehow. Thanks, bin > > -Andi
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
On Mon, Jan 14, 2019 at 04:15:20PM +0800, Bin.Cheng wrote: > On Mon, Jan 14, 2019 at 4:07 PM Andi Kleen wrote: > > > > Bin Cheng, > > > > I did some testing on this now. The attached patch automatically increases > > the iterations > > for autofdo profiles. > Hi Andi, thanks very much for tuning these. > > > > But even with even more iterations I still have stable failures in > > > > FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler foo[._]+cold > > FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler size[ > > \ta-zA-Z0-0]+foo[._]+cold > I think these two are supposed to fail with current code base. We should mark it as XFAIL then I guess. Is it understood why it doesn't work? > > FAIL: gcc.dg/tree-prof/indir-call-prof.c scan-ipa-dump afdo "Indirect call > > -> direct call.* a1 transformation on insn" > I also got unstable pass/fail for indirect call optimization when > tuning iterations, and haven't got an iteration number which passes > all the time. I guess we need to combine decreasing of sampling count > here. Okay I will look into that. Could also try if prime sample after values help, this sometimes fixes problems with systematically missing some code in sampling. > > FAIL: gcc.dg/tree-prof/peel-1.c scan-tree-dump cunroll "Peeled loop ., 1 > > times" > This one should fail too. Same. -Andi
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
On Mon, Jan 14, 2019 at 4:07 PM Andi Kleen wrote: > > Bin Cheng, > > I did some testing on this now. The attached patch automatically increases > the iterations > for autofdo profiles. Hi Andi, thanks very much for tuning these. > > But even with even more iterations I still have stable failures in > > FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler foo[._]+cold > FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler size[ > \ta-zA-Z0-0]+foo[._]+cold I think these two are supposed to fail with current code base. > FAIL: gcc.dg/tree-prof/indir-call-prof.c scan-ipa-dump afdo "Indirect call -> > direct call.* a1 transformation on insn" I also got unstable pass/fail for indirect call optimization when tuning iterations, and haven't got an iteration number which passes all the time. I guess we need to combine decreasing of sampling count here. > FAIL: gcc.dg/tree-prof/peel-1.c scan-tree-dump cunroll "Peeled loop ., 1 > times" This one should fail too. Thanks, bin > > Did these really ever work for you? > > -Andi > > > diff --git a/gcc/testsuite/g++.dg/tree-prof/morefunc.C > b/gcc/testsuite/g++.dg/tree-prof/morefunc.C > index a9bdc167f45..02b01c073e9 100644 > --- a/gcc/testsuite/g++.dg/tree-prof/morefunc.C > +++ b/gcc/testsuite/g++.dg/tree-prof/morefunc.C > @@ -2,6 +2,10 @@ > #include "reorder_class1.h" > #include "reorder_class2.h" > > +#ifndef ITER > +#define ITER 1000 > +#endif > + > int g; > > #ifdef _PROFILE_USE > @@ -19,7 +23,7 @@ static __attribute__((always_inline)) > void test1 (A *tc) > { >int i; > - for (i = 0; i < 1000; i++) > + for (i = 0; i < ITER; i++) > g += tc->foo(); > if (g<100) g++; > } > @@ -28,7 +32,7 @@ static __attribute__((always_inline)) > void test2 (B *tc) > { >int i; > - for (i = 0; i < 100; i++) > + for (i = 0; i < ITER; i++) > g += tc->foo(); > } > > diff --git a/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c > b/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c > index 450308d6407..099069da6a7 100644 > --- a/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c > +++ b/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c > @@ -9,6 +9,10 @@ const char *sarr[SIZE]; > const char *buf_hot; > const char *buf_cold; > > +#ifndef ITER > +#define ITER 100 > +#endif > + > __attribute__((noinline)) > void > foo (int path) > @@ -32,7 +36,7 @@ main (int argc, char *argv[]) >int i; >buf_hot = "hello"; >buf_cold = "world"; > - for (i = 0; i < 100; i++) > + for (i = 0; i < ITER; i++) > foo (argc); >return 0; > } > diff --git a/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c > b/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c > index 58109d54dc7..32d22c69c6c 100644 > --- a/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c > +++ b/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c > @@ -2,6 +2,10 @@ > /* { dg-additional-sources "crossmodule-indircall-1a.c" } */ > /* { dg-options "-O3 -flto -DDOJOB=1" } */ > > +#ifndef ITER > +#define ITER 1000 > +#endif > + > int a; > extern void (*p[2])(int n); > void abort (void); > @@ -10,12 +14,12 @@ main() > { int i; > >/* This call shall be converted. */ > - for (i = 0;i<1000;i++) > + for (i = 0;i p[0](1); >/* This call shall not be converted. */ > - for (i = 0;i<1000;i++) > + for (i = 0;i p[i%2](2); > - if (a != 1000) > + if (a != ITER) > abort (); > >return 0; > diff --git a/gcc/testsuite/gcc.dg/tree-prof/indir-call-prof.c > b/gcc/testsuite/gcc.dg/tree-prof/indir-call-prof.c > index 53063c3e7fa..8b9dfbb78c7 100644 > --- a/gcc/testsuite/gcc.dg/tree-prof/indir-call-prof.c > +++ b/gcc/testsuite/gcc.dg/tree-prof/indir-call-prof.c > @@ -1,5 +1,9 @@ > /* { dg-options "-O2 -fdump-tree-optimized -fdump-ipa-profile > -fdump-ipa-afdo" } */ > > +#ifndef ITER > +#define ITER 10 > +#endif > + > static int a1 (void) > { > return 10; > @@ -28,7 +32,7 @@ main (void) >int (*p) (void); >int i; > > - for (i = 0; i < 1000; i ++) > + for (i = 0; i < ITER*100; i++) > { > setp (, i); > p (); > diff --git a/gcc/testsuite/gcc.dg/tree-prof/peel-1.c > b/gcc/testsuite/gcc.dg/tree-prof/peel-1.c > index 7245b68c1ee..b6ed178e1ad 100644 > --- a/gcc/testsuite/gcc.dg/tree-prof/peel-1.c > +++ b/gcc/testsuite/gcc.dg/tree-prof/peel-1.c > @@ -1,13 +1,17 @@ > /* { dg-options "-O3 -fdump-tree-cunroll-details -fno-unroll-loops > -fpeel-loops" } */ > void abort(); > > -int a[1000]; > +#ifndef ITER > +#define ITER 1000 > +#endif > + > +int a[ITER]; > int > __attribute__ ((noinline)) > t() > { >int i; > - for (i=0;i<1000;i++) > + for (i=0;i if (!a[i]) >return 1; >abort (); > @@ -16,7 +20,7 @@ int > main() > { >int i; > - for (i=0;i<1000;i++) > + for (i=0;i t(); >return 0; > } > diff --git a/gcc/testsuite/gcc.dg/tree-prof/pr52027.c > b/gcc/testsuite/gcc.dg/tree-prof/pr52027.c > index c46a14b2e86..bf2a83a336d
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
Bin Cheng, I did some testing on this now. The attached patch automatically increases the iterations for autofdo profiles. But even with even more iterations I still have stable failures in FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler foo[._]+cold FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler size[ \ta-zA-Z0-0]+foo[._]+cold FAIL: gcc.dg/tree-prof/indir-call-prof.c scan-ipa-dump afdo "Indirect call -> direct call.* a1 transformation on insn" FAIL: gcc.dg/tree-prof/peel-1.c scan-tree-dump cunroll "Peeled loop ., 1 times" Did these really ever work for you? -Andi diff --git a/gcc/testsuite/g++.dg/tree-prof/morefunc.C b/gcc/testsuite/g++.dg/tree-prof/morefunc.C index a9bdc167f45..02b01c073e9 100644 --- a/gcc/testsuite/g++.dg/tree-prof/morefunc.C +++ b/gcc/testsuite/g++.dg/tree-prof/morefunc.C @@ -2,6 +2,10 @@ #include "reorder_class1.h" #include "reorder_class2.h" +#ifndef ITER +#define ITER 1000 +#endif + int g; #ifdef _PROFILE_USE @@ -19,7 +23,7 @@ static __attribute__((always_inline)) void test1 (A *tc) { int i; - for (i = 0; i < 1000; i++) + for (i = 0; i < ITER; i++) g += tc->foo(); if (g<100) g++; } @@ -28,7 +32,7 @@ static __attribute__((always_inline)) void test2 (B *tc) { int i; - for (i = 0; i < 100; i++) + for (i = 0; i < ITER; i++) g += tc->foo(); } diff --git a/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c b/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c index 450308d6407..099069da6a7 100644 --- a/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c +++ b/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c @@ -9,6 +9,10 @@ const char *sarr[SIZE]; const char *buf_hot; const char *buf_cold; +#ifndef ITER +#define ITER 100 +#endif + __attribute__((noinline)) void foo (int path) @@ -32,7 +36,7 @@ main (int argc, char *argv[]) int i; buf_hot = "hello"; buf_cold = "world"; - for (i = 0; i < 100; i++) + for (i = 0; i < ITER; i++) foo (argc); return 0; } diff --git a/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c b/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c index 58109d54dc7..32d22c69c6c 100644 --- a/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c +++ b/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c @@ -2,6 +2,10 @@ /* { dg-additional-sources "crossmodule-indircall-1a.c" } */ /* { dg-options "-O3 -flto -DDOJOB=1" } */ +#ifndef ITER +#define ITER 1000 +#endif + int a; extern void (*p[2])(int n); void abort (void); @@ -10,12 +14,12 @@ main() { int i; /* This call shall be converted. */ - for (i = 0;i<1000;i++) + for (i = 0;i
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
On Tue, 18 Dec 2018, Andi Kleen wrote: > > Yes, take g++.dg/tree-prof/morefunc.C as an example: > > - int i; > > - for (i = 0; i < 1000; i++) > > + int i, j; > > + for (i = 0; i < 100; i++) > > +for (j = 0; j < 50; j++) > > g += tc->foo(); > > if (g<100) g++; > > } > > @@ -27,8 +28,9 @@ void test1 (A *tc) > > static __attribute__((always_inline)) > > void test2 (B *tc) > > { > > - int i; > > + int i, j; > >for (i = 0; i < 100; i++) > > +for (j = 0; j < 50; j++) > > > > I have to increase loop count like this to get stable pass on my > > machine. The original count (1000) is too small to be sampled. > > IIRC It was originally higher, but people running on slow simulators > complained, > so it was reduced. Perhaps we need some way to detect in the test suite > that the test runs on a real CPU. Doesn't check_effective_target_simulator work here? See e.g. libstdc++-v3/testsuite/25_algorithms/heap/moveable2.cc for an example. brgds, H-P
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
On Wed, Dec 19, 2018 at 06:28:29PM +0100, Richard Biener wrote: > On Wed, Dec 19, 2018 at 4:41 PM Andi Kleen wrote: > > > > > > We can combine the two together, increasing iteration count and > > > > decreasing perf count at the same time. What count would you suggest > > > > from your experience? > > > > > > Can we instead for the tests where we want to test profile use/merge > > > elide the profiling step and supply the "raw" data in an testsuite > > > alternate > > > file instead? > > > > That would be possible, but a drawback is that we wouldn't have an > > "end2end" test anymore that also tests the interaction with perf > > and autofdo. Would be good to test these cases too, there were regressions > > in this before. > > Sure. > > > But perhaps splitting that into two separate tests is reasonable, > > with the majority of tests running with fake data. > > > > This would have the advantage that gcc developers who don't > > have an autofdo setup (e.g. missing tools or running in virtualization > > with PMU disabled) would still do most of the regression tests. > > Yes, I think the pros outweight the cons here. Well, at least if > generating such data that works on multiple archs is even possible? The gcov data that comes out of autofdo is architecture independent as far as I know. It's mainly counts per line. In fact even the perf input data should be fairly architecture independent (except perhaps for endian) I think it would need a way to write gcov data using text input (unless you want to put a lot of binaries into the repository) Also it would need to be adjusted every time a line number changes in the test cases. I guess best would be if dejagnu could somehow generate it from test case comments, but I don't know how complicated that would be. Doing such updates would be likely difficult with binaries. In the future if we ever re-add discriminator support again it would also need some way to specify the correct discriminator. I guess for simple test cases it could be ensured it is always 0. -Andi
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
On Wed, Dec 19, 2018 at 4:41 PM Andi Kleen wrote: > > > > We can combine the two together, increasing iteration count and > > > decreasing perf count at the same time. What count would you suggest > > > from your experience? > > > > Can we instead for the tests where we want to test profile use/merge > > elide the profiling step and supply the "raw" data in an testsuite alternate > > file instead? > > That would be possible, but a drawback is that we wouldn't have an > "end2end" test anymore that also tests the interaction with perf > and autofdo. Would be good to test these cases too, there were regressions > in this before. Sure. > But perhaps splitting that into two separate tests is reasonable, > with the majority of tests running with fake data. > > This would have the advantage that gcc developers who don't > have an autofdo setup (e.g. missing tools or running in virtualization > with PMU disabled) would still do most of the regression tests. Yes, I think the pros outweight the cons here. Well, at least if generating such data that works on multiple archs is even possible? Richard. > -Andi
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
On Wed, Dec 19, 2018 at 12:08:35PM +0800, Bin.Cheng wrote: > On Wed, Dec 19, 2018 at 12:00 PM Andi Kleen wrote: > > > > On Wed, Dec 19, 2018 at 10:01:15AM +0800, Bin.Cheng wrote: > > > On Tue, Dec 18, 2018 at 7:15 PM Bin.Cheng wrote: > > > > > > > > On Sun, Dec 16, 2018 at 9:11 AM Andi Kleen wrote: > > > > > > > > > > "bin.cheng" writes: > > > > > > > > > > > Hi, > > > > > > > > > > > > Due to ICE and mal-functional bugs, indirect call value profile > > > > > > transformation > > > > > > is disabled on GCC-7/8/trunk. This patch restores the > > > > > > transformation. The > > > > > > main issue is AutoFDO should store cgraph_node's profile_id of > > > > > > callee func in > > > > > > the first histogram value's counter, rather than pointer to > > > > > > callee's name string > > > > > > as it is now. > > > > > > With the patch, some "Indirect call -> direct call" tests pass with > > > > > > autofdo, while > > > > > > others are unstable. I think the instability is caused by poor > > > > > > perf data collected > > > > > > during regrets run, and can confirm these tests pass if good perf > > > > > > data could be > > > > > > collected in manual experiments. > > > > > > > > > > Would be good to make the tests stable, otherwise we'll just have > > > > > regressions in the future again. > > > > > > > > > > The problem is that the tests don't run long enough and don't get > > > > > enough samples? > > > > Yes, take g++.dg/tree-prof/morefunc.C as an example: > > > > - int i; > > > > - for (i = 0; i < 1000; i++) > > > > + int i, j; > > > > + for (i = 0; i < 100; i++) > > > > +for (j = 0; j < 50; j++) > > > > g += tc->foo(); > > > > if (g<100) g++; > > > > } > > > > @@ -27,8 +28,9 @@ void test1 (A *tc) > > > > static __attribute__((always_inline)) > > > > void test2 (B *tc) > > > > { > > > > - int i; > > > > + int i, j; > > > >for (i = 0; i < 100; i++) > > > > +for (j = 0; j < 50; j++) > > > > > > > > I have to increase loop count like this to get stable pass on my > > > > machine. The original count (1000) is too small to be sampled. > > > > > > > > > > > > > > Could add some loop? > > > > > Or possibly increase the sampling frequency in perf (-F or -c)? > > > > Maybe, I will have a try. > > > Turned out all "Indirect call" test can be resolved by adding -c 100 > > > to perf command line: > > > diff --git a/gcc/config/i386/gcc-auto-profile > > > b/gcc/config/i386/gcc-auto-profile > > > ... > > > -exec perf record -e $E -b "$@" > > > +exec perf record -e $E -c 100 -b "$@" > > > > > > Is 100 too small here? Or is it fine for all scenarios? > > > > -c 100 is risky because it can cause perf throttling, which > > makes it lose data. > Right, it looks suspicious to me too. > > > > > perf has a limiter that if the PMU handler uses too much CPU > > time it stops measuring for some time. A PMI is 10k+ cycles, > > so doing one every 100 branches is a lot of CPU time. > > > > I wouldn't go down that low. It is better to increase the > > iteration count. > We can combine the two together, increasing iteration count and > decreasing perf count at the same time. What count would you suggest > from your experience? Normally nothing less than 50k for a common event like branches. But for such a limited test 10k might still work, as long as the runtime is fair controlled. We would probably need to ensure at least 10+ samples, so that would be 100k iterations. iirc that was what we used originally, until people complained about the simulator run times. -Andi
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
> > We can combine the two together, increasing iteration count and > > decreasing perf count at the same time. What count would you suggest > > from your experience? > > Can we instead for the tests where we want to test profile use/merge > elide the profiling step and supply the "raw" data in an testsuite alternate > file instead? That would be possible, but a drawback is that we wouldn't have an "end2end" test anymore that also tests the interaction with perf and autofdo. Would be good to test these cases too, there were regressions in this before. But perhaps splitting that into two separate tests is reasonable, with the majority of tests running with fake data. This would have the advantage that gcc developers who don't have an autofdo setup (e.g. missing tools or running in virtualization with PMU disabled) would still do most of the regression tests. -Andi
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
On Wed, Dec 19, 2018 at 5:08 AM Bin.Cheng wrote: > > On Wed, Dec 19, 2018 at 12:00 PM Andi Kleen wrote: > > > > On Wed, Dec 19, 2018 at 10:01:15AM +0800, Bin.Cheng wrote: > > > On Tue, Dec 18, 2018 at 7:15 PM Bin.Cheng wrote: > > > > > > > > On Sun, Dec 16, 2018 at 9:11 AM Andi Kleen wrote: > > > > > > > > > > "bin.cheng" writes: > > > > > > > > > > > Hi, > > > > > > > > > > > > Due to ICE and mal-functional bugs, indirect call value profile > > > > > > transformation > > > > > > is disabled on GCC-7/8/trunk. This patch restores the > > > > > > transformation. The > > > > > > main issue is AutoFDO should store cgraph_node's profile_id of > > > > > > callee func in > > > > > > the first histogram value's counter, rather than pointer to > > > > > > callee's name string > > > > > > as it is now. > > > > > > With the patch, some "Indirect call -> direct call" tests pass with > > > > > > autofdo, while > > > > > > others are unstable. I think the instability is caused by poor > > > > > > perf data collected > > > > > > during regrets run, and can confirm these tests pass if good perf > > > > > > data could be > > > > > > collected in manual experiments. > > > > > > > > > > Would be good to make the tests stable, otherwise we'll just have > > > > > regressions in the future again. > > > > > > > > > > The problem is that the tests don't run long enough and don't get > > > > > enough samples? > > > > Yes, take g++.dg/tree-prof/morefunc.C as an example: > > > > - int i; > > > > - for (i = 0; i < 1000; i++) > > > > + int i, j; > > > > + for (i = 0; i < 100; i++) > > > > +for (j = 0; j < 50; j++) > > > > g += tc->foo(); > > > > if (g<100) g++; > > > > } > > > > @@ -27,8 +28,9 @@ void test1 (A *tc) > > > > static __attribute__((always_inline)) > > > > void test2 (B *tc) > > > > { > > > > - int i; > > > > + int i, j; > > > >for (i = 0; i < 100; i++) > > > > +for (j = 0; j < 50; j++) > > > > > > > > I have to increase loop count like this to get stable pass on my > > > > machine. The original count (1000) is too small to be sampled. > > > > > > > > > > > > > > Could add some loop? > > > > > Or possibly increase the sampling frequency in perf (-F or -c)? > > > > Maybe, I will have a try. > > > Turned out all "Indirect call" test can be resolved by adding -c 100 > > > to perf command line: > > > diff --git a/gcc/config/i386/gcc-auto-profile > > > b/gcc/config/i386/gcc-auto-profile > > > ... > > > -exec perf record -e $E -b "$@" > > > +exec perf record -e $E -c 100 -b "$@" > > > > > > Is 100 too small here? Or is it fine for all scenarios? > > > > -c 100 is risky because it can cause perf throttling, which > > makes it lose data. > Right, it looks suspicious to me too. > > > > > perf has a limiter that if the PMU handler uses too much CPU > > time it stops measuring for some time. A PMI is 10k+ cycles, > > so doing one every 100 branches is a lot of CPU time. > > > > I wouldn't go down that low. It is better to increase the > > iteration count. > We can combine the two together, increasing iteration count and > decreasing perf count at the same time. What count would you suggest > from your experience? Can we instead for the tests where we want to test profile use/merge elide the profiling step and supply the "raw" data in an testsuite alternate file instead? Richard. > Thanks, > bin > > > > -Andi
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
On Wed, Dec 19, 2018 at 12:00 PM Andi Kleen wrote: > > On Wed, Dec 19, 2018 at 10:01:15AM +0800, Bin.Cheng wrote: > > On Tue, Dec 18, 2018 at 7:15 PM Bin.Cheng wrote: > > > > > > On Sun, Dec 16, 2018 at 9:11 AM Andi Kleen wrote: > > > > > > > > "bin.cheng" writes: > > > > > > > > > Hi, > > > > > > > > > > Due to ICE and mal-functional bugs, indirect call value profile > > > > > transformation > > > > > is disabled on GCC-7/8/trunk. This patch restores the > > > > > transformation. The > > > > > main issue is AutoFDO should store cgraph_node's profile_id of callee > > > > > func in > > > > > the first histogram value's counter, rather than pointer to callee's > > > > > name string > > > > > as it is now. > > > > > With the patch, some "Indirect call -> direct call" tests pass with > > > > > autofdo, while > > > > > others are unstable. I think the instability is caused by poor perf > > > > > data collected > > > > > during regrets run, and can confirm these tests pass if good perf > > > > > data could be > > > > > collected in manual experiments. > > > > > > > > Would be good to make the tests stable, otherwise we'll just have > > > > regressions in the future again. > > > > > > > > The problem is that the tests don't run long enough and don't get > > > > enough samples? > > > Yes, take g++.dg/tree-prof/morefunc.C as an example: > > > - int i; > > > - for (i = 0; i < 1000; i++) > > > + int i, j; > > > + for (i = 0; i < 100; i++) > > > +for (j = 0; j < 50; j++) > > > g += tc->foo(); > > > if (g<100) g++; > > > } > > > @@ -27,8 +28,9 @@ void test1 (A *tc) > > > static __attribute__((always_inline)) > > > void test2 (B *tc) > > > { > > > - int i; > > > + int i, j; > > >for (i = 0; i < 100; i++) > > > +for (j = 0; j < 50; j++) > > > > > > I have to increase loop count like this to get stable pass on my > > > machine. The original count (1000) is too small to be sampled. > > > > > > > > > > > Could add some loop? > > > > Or possibly increase the sampling frequency in perf (-F or -c)? > > > Maybe, I will have a try. > > Turned out all "Indirect call" test can be resolved by adding -c 100 > > to perf command line: > > diff --git a/gcc/config/i386/gcc-auto-profile > > b/gcc/config/i386/gcc-auto-profile > > ... > > -exec perf record -e $E -b "$@" > > +exec perf record -e $E -c 100 -b "$@" > > > > Is 100 too small here? Or is it fine for all scenarios? > > -c 100 is risky because it can cause perf throttling, which > makes it lose data. Right, it looks suspicious to me too. > > perf has a limiter that if the PMU handler uses too much CPU > time it stops measuring for some time. A PMI is 10k+ cycles, > so doing one every 100 branches is a lot of CPU time. > > I wouldn't go down that low. It is better to increase the > iteration count. We can combine the two together, increasing iteration count and decreasing perf count at the same time. What count would you suggest from your experience? Thanks, bin > > -Andi
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
On Wed, Dec 19, 2018 at 10:01:15AM +0800, Bin.Cheng wrote: > On Tue, Dec 18, 2018 at 7:15 PM Bin.Cheng wrote: > > > > On Sun, Dec 16, 2018 at 9:11 AM Andi Kleen wrote: > > > > > > "bin.cheng" writes: > > > > > > > Hi, > > > > > > > > Due to ICE and mal-functional bugs, indirect call value profile > > > > transformation > > > > is disabled on GCC-7/8/trunk. This patch restores the transformation. > > > > The > > > > main issue is AutoFDO should store cgraph_node's profile_id of callee > > > > func in > > > > the first histogram value's counter, rather than pointer to callee's > > > > name string > > > > as it is now. > > > > With the patch, some "Indirect call -> direct call" tests pass with > > > > autofdo, while > > > > others are unstable. I think the instability is caused by poor perf > > > > data collected > > > > during regrets run, and can confirm these tests pass if good perf data > > > > could be > > > > collected in manual experiments. > > > > > > Would be good to make the tests stable, otherwise we'll just have > > > regressions in the future again. > > > > > > The problem is that the tests don't run long enough and don't get enough > > > samples? > > Yes, take g++.dg/tree-prof/morefunc.C as an example: > > - int i; > > - for (i = 0; i < 1000; i++) > > + int i, j; > > + for (i = 0; i < 100; i++) > > +for (j = 0; j < 50; j++) > > g += tc->foo(); > > if (g<100) g++; > > } > > @@ -27,8 +28,9 @@ void test1 (A *tc) > > static __attribute__((always_inline)) > > void test2 (B *tc) > > { > > - int i; > > + int i, j; > >for (i = 0; i < 100; i++) > > +for (j = 0; j < 50; j++) > > > > I have to increase loop count like this to get stable pass on my > > machine. The original count (1000) is too small to be sampled. > > > > > > > > Could add some loop? > > > Or possibly increase the sampling frequency in perf (-F or -c)? > > Maybe, I will have a try. > Turned out all "Indirect call" test can be resolved by adding -c 100 > to perf command line: > diff --git a/gcc/config/i386/gcc-auto-profile > b/gcc/config/i386/gcc-auto-profile > ... > -exec perf record -e $E -b "$@" > +exec perf record -e $E -c 100 -b "$@" > > Is 100 too small here? Or is it fine for all scenarios? -c 100 is risky because it can cause perf throttling, which makes it lose data. perf has a limiter that if the PMU handler uses too much CPU time it stops measuring for some time. A PMI is 10k+ cycles, so doing one every 100 branches is a lot of CPU time. I wouldn't go down that low. It is better to increase the iteration count. -Andi
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
On Wed, Dec 19, 2018 at 09:26:51AM +0800, Bin.Cheng wrote: > On Wed, Dec 19, 2018 at 5:27 AM Andi Kleen wrote: > > > > > Yes, take g++.dg/tree-prof/morefunc.C as an example: > > > - int i; > > > - for (i = 0; i < 1000; i++) > > > + int i, j; > > > + for (i = 0; i < 100; i++) > > > +for (j = 0; j < 50; j++) > > > g += tc->foo(); > > > if (g<100) g++; > > > } > > > @@ -27,8 +28,9 @@ void test1 (A *tc) > > > static __attribute__((always_inline)) > > > void test2 (B *tc) > > > { > > > - int i; > > > + int i, j; > > >for (i = 0; i < 100; i++) > > > +for (j = 0; j < 50; j++) > > > > > > I have to increase loop count like this to get stable pass on my > > > machine. The original count (1000) is too small to be sampled. > > > > IIRC It was originally higher, but people running on slow simulators > > complained, > > so it was reduced. Perhaps we need some way to detect in the test suite > > that the test runs on a real CPU. > Is there concise way to do this, given gcc may be run on all kinds of > virtual scenarios? Virtual should be fine too, just simulators are too slow. I hope there is, because we certainly need a solution for production ready autofdo. Or perhaps could just check if perf is working and only run the tests if that is true. The TCL code already checks that. Just would need to pass that information somehow as a define. Overall I suspect far more test coverage is needed to make it solid. The existing tests are not that great. > > > > > > > > > > > FYI, an update about AutoFDO status: > > > > > All AutoFDO ICEs in regtest are fixed, while several tests still > > > > > failing fall in below > > > > > three categories: > > > > > > > > Great! > > > > > > > > Of course it still ICEs with LTO? > > > > > > > > Right now there is no test case for this I think. Probably one should > > > > be added. > > > > > > Any comments on this? > We'd like to further investigate AutoFDO+LTO, may I ask what the > status is (or was)? Any background elaboration about this would be > appreciated. It just never worked and ICEs very quickly if you try it. There's an open PR (PR71672) There are some other open issues with autofdo BTW, e.g. the old 4.9 google branch still has more features than mainline. For example it supported discriminators, so can distinguish more than one basic block per source line. The last time I tested the gains with mainline autofdo were also significantly less than 4.9-google, so there might be other tunings missing. -Andi
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
On Tue, Dec 18, 2018 at 7:15 PM Bin.Cheng wrote: > > On Sun, Dec 16, 2018 at 9:11 AM Andi Kleen wrote: > > > > "bin.cheng" writes: > > > > > Hi, > > > > > > Due to ICE and mal-functional bugs, indirect call value profile > > > transformation > > > is disabled on GCC-7/8/trunk. This patch restores the transformation. > > > The > > > main issue is AutoFDO should store cgraph_node's profile_id of callee > > > func in > > > the first histogram value's counter, rather than pointer to callee's name > > > string > > > as it is now. > > > With the patch, some "Indirect call -> direct call" tests pass with > > > autofdo, while > > > others are unstable. I think the instability is caused by poor perf data > > > collected > > > during regrets run, and can confirm these tests pass if good perf data > > > could be > > > collected in manual experiments. > > > > Would be good to make the tests stable, otherwise we'll just have > > regressions in the future again. > > > > The problem is that the tests don't run long enough and don't get enough > > samples? > Yes, take g++.dg/tree-prof/morefunc.C as an example: > - int i; > - for (i = 0; i < 1000; i++) > + int i, j; > + for (i = 0; i < 100; i++) > +for (j = 0; j < 50; j++) > g += tc->foo(); > if (g<100) g++; > } > @@ -27,8 +28,9 @@ void test1 (A *tc) > static __attribute__((always_inline)) > void test2 (B *tc) > { > - int i; > + int i, j; >for (i = 0; i < 100; i++) > +for (j = 0; j < 50; j++) > > I have to increase loop count like this to get stable pass on my > machine. The original count (1000) is too small to be sampled. > > > > > Could add some loop? > > Or possibly increase the sampling frequency in perf (-F or -c)? > Maybe, I will have a try. Turned out all "Indirect call" test can be resolved by adding -c 100 to perf command line: diff --git a/gcc/config/i386/gcc-auto-profile b/gcc/config/i386/gcc-auto-profile ... -exec perf record -e $E -b "$@" +exec perf record -e $E -c 100 -b "$@" Is 100 too small here? Or is it fine for all scenarios? Thanks, bin > > Or run them multiple times and use gcov_merge to merge the files? > Without changing loop count or sampling frequency, this is not likely > to be helpful, since perf doesn't hit the small loop in most cases. > > Thanks, > bin > > > > > > > FYI, an update about AutoFDO status: > > > All AutoFDO ICEs in regtest are fixed, while several tests still failing > > > fall in below > > > three categories: > > > > Great! > > > > Of course it still ICEs with LTO? > > > > Right now there is no test case for this I think. Probably one should be > > added. > > > > -Andi
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
On Wed, Dec 19, 2018 at 5:27 AM Andi Kleen wrote: > > > Yes, take g++.dg/tree-prof/morefunc.C as an example: > > - int i; > > - for (i = 0; i < 1000; i++) > > + int i, j; > > + for (i = 0; i < 100; i++) > > +for (j = 0; j < 50; j++) > > g += tc->foo(); > > if (g<100) g++; > > } > > @@ -27,8 +28,9 @@ void test1 (A *tc) > > static __attribute__((always_inline)) > > void test2 (B *tc) > > { > > - int i; > > + int i, j; > >for (i = 0; i < 100; i++) > > +for (j = 0; j < 50; j++) > > > > I have to increase loop count like this to get stable pass on my > > machine. The original count (1000) is too small to be sampled. > > IIRC It was originally higher, but people running on slow simulators > complained, > so it was reduced. Perhaps we need some way to detect in the test suite > that the test runs on a real CPU. Is there concise way to do this, given gcc may be run on all kinds of virtual scenarios? > > > > > > > FYI, an update about AutoFDO status: > > > > All AutoFDO ICEs in regtest are fixed, while several tests still > > > > failing fall in below > > > > three categories: > > > > > > Great! > > > > > > Of course it still ICEs with LTO? > > > > > > Right now there is no test case for this I think. Probably one should be > > > added. > > > Any comments on this? We'd like to further investigate AutoFDO+LTO, may I ask what the status is (or was)? Any background elaboration about this would be appreciated. Thanks, bin > > -Andi
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
> Yes, take g++.dg/tree-prof/morefunc.C as an example: > - int i; > - for (i = 0; i < 1000; i++) > + int i, j; > + for (i = 0; i < 100; i++) > +for (j = 0; j < 50; j++) > g += tc->foo(); > if (g<100) g++; > } > @@ -27,8 +28,9 @@ void test1 (A *tc) > static __attribute__((always_inline)) > void test2 (B *tc) > { > - int i; > + int i, j; >for (i = 0; i < 100; i++) > +for (j = 0; j < 50; j++) > > I have to increase loop count like this to get stable pass on my > machine. The original count (1000) is too small to be sampled. IIRC It was originally higher, but people running on slow simulators complained, so it was reduced. Perhaps we need some way to detect in the test suite that the test runs on a real CPU. > > > > FYI, an update about AutoFDO status: > > > All AutoFDO ICEs in regtest are fixed, while several tests still failing > > > fall in below > > > three categories: > > > > Great! > > > > Of course it still ICEs with LTO? > > > > Right now there is no test case for this I think. Probably one should be > > added. Any comments on this? -Andi
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
On Sun, Dec 16, 2018 at 9:11 AM Andi Kleen wrote: > > "bin.cheng" writes: > > > Hi, > > > > Due to ICE and mal-functional bugs, indirect call value profile > > transformation > > is disabled on GCC-7/8/trunk. This patch restores the transformation. The > > main issue is AutoFDO should store cgraph_node's profile_id of callee func > > in > > the first histogram value's counter, rather than pointer to callee's name > > string > > as it is now. > > With the patch, some "Indirect call -> direct call" tests pass with > > autofdo, while > > others are unstable. I think the instability is caused by poor perf data > > collected > > during regrets run, and can confirm these tests pass if good perf data > > could be > > collected in manual experiments. > > Would be good to make the tests stable, otherwise we'll just have > regressions in the future again. > > The problem is that the tests don't run long enough and don't get enough > samples? Yes, take g++.dg/tree-prof/morefunc.C as an example: - int i; - for (i = 0; i < 1000; i++) + int i, j; + for (i = 0; i < 100; i++) +for (j = 0; j < 50; j++) g += tc->foo(); if (g<100) g++; } @@ -27,8 +28,9 @@ void test1 (A *tc) static __attribute__((always_inline)) void test2 (B *tc) { - int i; + int i, j; for (i = 0; i < 100; i++) +for (j = 0; j < 50; j++) I have to increase loop count like this to get stable pass on my machine. The original count (1000) is too small to be sampled. > > Could add some loop? > Or possibly increase the sampling frequency in perf (-F or -c)? Maybe, I will have a try. > Or run them multiple times and use gcov_merge to merge the files? Without changing loop count or sampling frequency, this is not likely to be helpful, since perf doesn't hit the small loop in most cases. Thanks, bin > > > > FYI, an update about AutoFDO status: > > All AutoFDO ICEs in regtest are fixed, while several tests still failing > > fall in below > > three categories: > > Great! > > Of course it still ICEs with LTO? > > Right now there is no test case for this I think. Probably one should be > added. > > -Andi
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
"bin.cheng" writes: > Hi, > > Due to ICE and mal-functional bugs, indirect call value profile transformation > is disabled on GCC-7/8/trunk. This patch restores the transformation. The > main issue is AutoFDO should store cgraph_node's profile_id of callee func in > the first histogram value's counter, rather than pointer to callee's name > string > as it is now. > With the patch, some "Indirect call -> direct call" tests pass with autofdo, > while > others are unstable. I think the instability is caused by poor perf data > collected > during regrets run, and can confirm these tests pass if good perf data could > be > collected in manual experiments. Would be good to make the tests stable, otherwise we'll just have regressions in the future again. The problem is that the tests don't run long enough and don't get enough samples? Could add some loop? Or possibly increase the sampling frequency in perf (-F or -c)? Or run them multiple times and use gcov_merge to merge the files? > FYI, an update about AutoFDO status: > All AutoFDO ICEs in regtest are fixed, while several tests still failing fall > in below > three categories: Great! Of course it still ICEs with LTO? Right now there is no test case for this I think. Probably one should be added. -Andi
Re: [PATCH AutoFDO]Restoring indirect call value profile transformation
On 12/12/18 8:50 PM, bin.cheng wrote: > Hi, > > Due to ICE and mal-functional bugs, indirect call value profile transformation > is disabled on GCC-7/8/trunk. This patch restores the transformation. The > main issue is AutoFDO should store cgraph_node's profile_id of callee func in > the first histogram value's counter, rather than pointer to callee's name > string > as it is now. > With the patch, some "Indirect call -> direct call" tests pass with autofdo, > while > others are unstable. I think the instability is caused by poor perf data > collected > during regrets run, and can confirm these tests pass if good perf data could > be > collected in manual experiments. > > Bootstrap and test along with previous patches. Is it OK? > > FYI, an update about AutoFDO status: > All AutoFDO ICEs in regtest are fixed, while several tests still failing fall > in below > three categories: > > Unstable indirect call value profile transformation: > FAIL: g++.dg/tree-prof/indir-call-prof.C scan-ipa-dump afdo "Indirect call -> > direct call.* AA transformation on insn" > FAIL: g++.dg/tree-prof/morefunc.C scan-ipa-dump-times afdo "Indirect call -> > direct call" 2 > FAIL: g++.dg/tree-prof/pr35545.C scan-ipa-dump profile_estimate "Indirect > call -> direct call" > > loop peeling case because we don't honor autofdo profile count as reliable: > FAIL: gcc.dg/tree-prof/peel-1.c scan-tree-dump cunroll "Peeled loop ., 1 > times" > > cold/hot partition cases: > FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler foo[._]+cold > FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler size[ > \ta-zA-Z0-0]+foo[._]+cold > FAIL: gcc.dg/tree-prof/section-attr-1.c scan-assembler .section[\t > ]*.text.unlikely[\\n\\r]+[\t ]*.size[\t ]*foo.cold > FAIL: gcc.dg/tree-prof/section-attr-2.c scan-assembler .section[\t > ]*.text.unlikely[\\n\\r]+[\t ]*.size[\t ]*foo.cold > FAIL: gcc.dg/tree-prof/section-attr-3.c scan-assembler .section[\t > ]*.text.unlikely[\\n\\r]+[\t ]*.size[\t ]*foo.cold > These are more difficult to enable because we can't simply treat autofdo::zero > count as cold, it's just too many. > > Besides regtest, I run autofdo with kernel/mysql-server, the build and > performance > match expectations now, but I haven't run autofdo with any spec yet. > > Thanks, > bin > > 2018-12-13 Bin Cheng > > * auto-profile.c (afdo_indirect_call): Skip generating histogram > value if we can't find cgraph_node for then indirected callee. Save > profile_id of the cgraph_node in histogram value's first counter. > * value-prof.c (gimple_value_profile_transformations): Don't skip > for flag_auto_profile. OK jeff
[PATCH AutoFDO]Restoring indirect call value profile transformation
Hi, Due to ICE and mal-functional bugs, indirect call value profile transformation is disabled on GCC-7/8/trunk. This patch restores the transformation. The main issue is AutoFDO should store cgraph_node's profile_id of callee func in the first histogram value's counter, rather than pointer to callee's name string as it is now. With the patch, some "Indirect call -> direct call" tests pass with autofdo, while others are unstable. I think the instability is caused by poor perf data collected during regrets run, and can confirm these tests pass if good perf data could be collected in manual experiments. Bootstrap and test along with previous patches. Is it OK? FYI, an update about AutoFDO status: All AutoFDO ICEs in regtest are fixed, while several tests still failing fall in below three categories: Unstable indirect call value profile transformation: FAIL: g++.dg/tree-prof/indir-call-prof.C scan-ipa-dump afdo "Indirect call -> direct call.* AA transformation on insn" FAIL: g++.dg/tree-prof/morefunc.C scan-ipa-dump-times afdo "Indirect call -> direct call" 2 FAIL: g++.dg/tree-prof/pr35545.C scan-ipa-dump profile_estimate "Indirect call -> direct call" loop peeling case because we don't honor autofdo profile count as reliable: FAIL: gcc.dg/tree-prof/peel-1.c scan-tree-dump cunroll "Peeled loop ., 1 times" cold/hot partition cases: FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler foo[._]+cold FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler size[ \ta-zA-Z0-0]+foo[._]+cold FAIL: gcc.dg/tree-prof/section-attr-1.c scan-assembler .section[\t ]*.text.unlikely[\\n\\r]+[\t ]*.size[\t ]*foo.cold FAIL: gcc.dg/tree-prof/section-attr-2.c scan-assembler .section[\t ]*.text.unlikely[\\n\\r]+[\t ]*.size[\t ]*foo.cold FAIL: gcc.dg/tree-prof/section-attr-3.c scan-assembler .section[\t ]*.text.unlikely[\\n\\r]+[\t ]*.size[\t ]*foo.cold These are more difficult to enable because we can't simply treat autofdo::zero count as cold, it's just too many. Besides regtest, I run autofdo with kernel/mysql-server, the build and performance match expectations now, but I haven't run autofdo with any spec yet. Thanks, bin 2018-12-13 Bin Cheng * auto-profile.c (afdo_indirect_call): Skip generating histogram value if we can't find cgraph_node for then indirected callee. Save profile_id of the cgraph_node in histogram value's first counter. * value-prof.c (gimple_value_profile_transformations): Don't skip for flag_auto_profile. 0007-Fix-autofdo-indirect-call-value-prof-transformation.patch Description: Binary data