Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2019-01-31 Thread Bin.Cheng
On Tue, Jan 15, 2019 at 1:10 AM Andi Kleen  wrote:
>
> On Mon, Jan 14, 2019 at 04:15:20PM +0800, Bin.Cheng wrote:
> > On Mon, Jan 14, 2019 at 4:07 PM Andi Kleen  wrote:
> > >
> > > Bin Cheng,
> > >
> > > I did some testing on this now. The attached patch automatically 
> > > increases the iterations
> > > for autofdo profiles.
> > Hi Andi, thanks very much for tuning these.
> > >
> > > But even with even more iterations I still have stable failures in
> > >
> > > FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler foo[._]+cold
> > > FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler size[ 
> > > \ta-zA-Z0-0]+foo[._]+cold
> > I think these two are supposed to fail with current code base.
>
>
Sorry for being late.
> We should mark it as XFAIL then I guess.
Yeah
>
> Is it understood why it doesn't work?
I didn't dig into it, but I think the reason is autofdo::zero count is
not considered as cold as profile count.  It's kind of reasonable,
otherwise too much code would be categorized as cold.
>
> > > FAIL: gcc.dg/tree-prof/indir-call-prof.c scan-ipa-dump afdo "Indirect 
> > > call -> direct call.* a1 transformation on insn"
> > I also got unstable pass/fail for indirect call optimization when
> > tuning iterations, and haven't got an iteration number which passes
> > all the time.  I guess we need to combine decreasing of sampling count
> > here.
>
> Okay I will look into that.
>
> Could also try if prime sample after values help, this sometimes fixes
> problems with systematically missing some code in sampling.
>
> > > FAIL: gcc.dg/tree-prof/peel-1.c scan-tree-dump cunroll "Peeled loop ., 1 
> > > times"
> > This one should fail too.
>
> Same.
Maybe I shouldn't use words "should fail", but the reason is we don't
consider autofdo profile count as reliable in niters analysis.  Maybe
this can be improved somehow.

Thanks,
bin
>
> -Andi


Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2019-01-14 Thread Andi Kleen
On Mon, Jan 14, 2019 at 04:15:20PM +0800, Bin.Cheng wrote:
> On Mon, Jan 14, 2019 at 4:07 PM Andi Kleen  wrote:
> >
> > Bin Cheng,
> >
> > I did some testing on this now. The attached patch automatically increases 
> > the iterations
> > for autofdo profiles.
> Hi Andi, thanks very much for tuning these.
> >
> > But even with even more iterations I still have stable failures in
> >
> > FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler foo[._]+cold
> > FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler size[ 
> > \ta-zA-Z0-0]+foo[._]+cold
> I think these two are supposed to fail with current code base.


We should mark it as XFAIL then I guess.

Is it understood why it doesn't work?

> > FAIL: gcc.dg/tree-prof/indir-call-prof.c scan-ipa-dump afdo "Indirect call 
> > -> direct call.* a1 transformation on insn"
> I also got unstable pass/fail for indirect call optimization when
> tuning iterations, and haven't got an iteration number which passes
> all the time.  I guess we need to combine decreasing of sampling count
> here.

Okay I will look into that.

Could also try if prime sample after values help, this sometimes fixes
problems with systematically missing some code in sampling.

> > FAIL: gcc.dg/tree-prof/peel-1.c scan-tree-dump cunroll "Peeled loop ., 1 
> > times"
> This one should fail too.

Same.

-Andi


Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2019-01-14 Thread Bin.Cheng
On Mon, Jan 14, 2019 at 4:07 PM Andi Kleen  wrote:
>
> Bin Cheng,
>
> I did some testing on this now. The attached patch automatically increases 
> the iterations
> for autofdo profiles.
Hi Andi, thanks very much for tuning these.
>
> But even with even more iterations I still have stable failures in
>
> FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler foo[._]+cold
> FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler size[ 
> \ta-zA-Z0-0]+foo[._]+cold
I think these two are supposed to fail with current code base.
> FAIL: gcc.dg/tree-prof/indir-call-prof.c scan-ipa-dump afdo "Indirect call -> 
> direct call.* a1 transformation on insn"
I also got unstable pass/fail for indirect call optimization when
tuning iterations, and haven't got an iteration number which passes
all the time.  I guess we need to combine decreasing of sampling count
here.
> FAIL: gcc.dg/tree-prof/peel-1.c scan-tree-dump cunroll "Peeled loop ., 1 
> times"
This one should fail too.

Thanks,
bin
>
> Did these really ever work for you?
>
> -Andi
>
>
> diff --git a/gcc/testsuite/g++.dg/tree-prof/morefunc.C 
> b/gcc/testsuite/g++.dg/tree-prof/morefunc.C
> index a9bdc167f45..02b01c073e9 100644
> --- a/gcc/testsuite/g++.dg/tree-prof/morefunc.C
> +++ b/gcc/testsuite/g++.dg/tree-prof/morefunc.C
> @@ -2,6 +2,10 @@
>  #include "reorder_class1.h"
>  #include "reorder_class2.h"
>
> +#ifndef ITER
> +#define ITER 1000
> +#endif
> +
>  int g;
>
>  #ifdef _PROFILE_USE
> @@ -19,7 +23,7 @@ static __attribute__((always_inline))
>  void test1 (A *tc)
>  {
>int i;
> -  for (i = 0; i < 1000; i++)
> +  for (i = 0; i < ITER; i++)
>   g += tc->foo();
> if (g<100) g++;
>  }
> @@ -28,7 +32,7 @@ static __attribute__((always_inline))
>  void test2 (B *tc)
>  {
>int i;
> -  for (i = 0; i < 100; i++)
> +  for (i = 0; i < ITER; i++)
>   g += tc->foo();
>  }
>
> diff --git a/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c 
> b/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c
> index 450308d6407..099069da6a7 100644
> --- a/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c
> +++ b/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c
> @@ -9,6 +9,10 @@ const char *sarr[SIZE];
>  const char *buf_hot;
>  const char *buf_cold;
>
> +#ifndef ITER
> +#define ITER 100
> +#endif
> +
>  __attribute__((noinline))
>  void
>  foo (int path)
> @@ -32,7 +36,7 @@ main (int argc, char *argv[])
>int i;
>buf_hot =  "hello";
>buf_cold = "world";
> -  for (i = 0; i < 100; i++)
> +  for (i = 0; i < ITER; i++)
>  foo (argc);
>return 0;
>  }
> diff --git a/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c 
> b/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c
> index 58109d54dc7..32d22c69c6c 100644
> --- a/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c
> +++ b/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c
> @@ -2,6 +2,10 @@
>  /* { dg-additional-sources "crossmodule-indircall-1a.c" } */
>  /* { dg-options "-O3 -flto -DDOJOB=1" } */
>
> +#ifndef ITER
> +#define ITER 1000
> +#endif
> +
>  int a;
>  extern void (*p[2])(int n);
>  void abort (void);
> @@ -10,12 +14,12 @@ main()
>  { int i;
>
>/* This call shall be converted.  */
> -  for (i = 0;i<1000;i++)
> +  for (i = 0;i  p[0](1);
>/* This call shall not be converted.  */
> -  for (i = 0;i<1000;i++)
> +  for (i = 0;i  p[i%2](2);
> -  if (a != 1000)
> +  if (a != ITER)
>  abort ();
>
>return 0;
> diff --git a/gcc/testsuite/gcc.dg/tree-prof/indir-call-prof.c 
> b/gcc/testsuite/gcc.dg/tree-prof/indir-call-prof.c
> index 53063c3e7fa..8b9dfbb78c7 100644
> --- a/gcc/testsuite/gcc.dg/tree-prof/indir-call-prof.c
> +++ b/gcc/testsuite/gcc.dg/tree-prof/indir-call-prof.c
> @@ -1,5 +1,9 @@
>  /* { dg-options "-O2 -fdump-tree-optimized -fdump-ipa-profile 
> -fdump-ipa-afdo" } */
>
> +#ifndef ITER
> +#define ITER 10
> +#endif
> +
>  static int a1 (void)
>  {
>  return 10;
> @@ -28,7 +32,7 @@ main (void)
>int (*p) (void);
>int  i;
>
> -  for (i = 0; i < 1000; i ++)
> +  for (i = 0; i < ITER*100; i++)
>  {
> setp (, i);
> p ();
> diff --git a/gcc/testsuite/gcc.dg/tree-prof/peel-1.c 
> b/gcc/testsuite/gcc.dg/tree-prof/peel-1.c
> index 7245b68c1ee..b6ed178e1ad 100644
> --- a/gcc/testsuite/gcc.dg/tree-prof/peel-1.c
> +++ b/gcc/testsuite/gcc.dg/tree-prof/peel-1.c
> @@ -1,13 +1,17 @@
>  /* { dg-options "-O3 -fdump-tree-cunroll-details -fno-unroll-loops 
> -fpeel-loops" } */
>  void abort();
>
> -int a[1000];
> +#ifndef ITER
> +#define ITER 1000
> +#endif
> +
> +int a[ITER];
>  int
>  __attribute__ ((noinline))
>  t()
>  {
>int i;
> -  for (i=0;i<1000;i++)
> +  for (i=0;i  if (!a[i])
>return 1;
>abort ();
> @@ -16,7 +20,7 @@ int
>  main()
>  {
>int i;
> -  for (i=0;i<1000;i++)
> +  for (i=0;i  t();
>return 0;
>  }
> diff --git a/gcc/testsuite/gcc.dg/tree-prof/pr52027.c 
> b/gcc/testsuite/gcc.dg/tree-prof/pr52027.c
> index c46a14b2e86..bf2a83a336d 

Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2019-01-14 Thread Andi Kleen
Bin Cheng,

I did some testing on this now. The attached patch automatically increases the 
iterations
for autofdo profiles.

But even with even more iterations I still have stable failures in 

FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler foo[._]+cold
FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler size[ 
\ta-zA-Z0-0]+foo[._]+cold
FAIL: gcc.dg/tree-prof/indir-call-prof.c scan-ipa-dump afdo "Indirect call -> 
direct call.* a1 transformation on insn"
FAIL: gcc.dg/tree-prof/peel-1.c scan-tree-dump cunroll "Peeled loop ., 1 times"

Did these really ever work for you? 

-Andi


diff --git a/gcc/testsuite/g++.dg/tree-prof/morefunc.C 
b/gcc/testsuite/g++.dg/tree-prof/morefunc.C
index a9bdc167f45..02b01c073e9 100644
--- a/gcc/testsuite/g++.dg/tree-prof/morefunc.C
+++ b/gcc/testsuite/g++.dg/tree-prof/morefunc.C
@@ -2,6 +2,10 @@
 #include "reorder_class1.h"
 #include "reorder_class2.h"
 
+#ifndef ITER
+#define ITER 1000
+#endif
+
 int g;
 
 #ifdef _PROFILE_USE
@@ -19,7 +23,7 @@ static __attribute__((always_inline))
 void test1 (A *tc)
 {
   int i;
-  for (i = 0; i < 1000; i++)
+  for (i = 0; i < ITER; i++)
  g += tc->foo(); 
if (g<100) g++;
 }
@@ -28,7 +32,7 @@ static __attribute__((always_inline))
 void test2 (B *tc)
 {
   int i;
-  for (i = 0; i < 100; i++)
+  for (i = 0; i < ITER; i++)
  g += tc->foo();
 }
 
diff --git a/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c 
b/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c
index 450308d6407..099069da6a7 100644
--- a/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c
+++ b/gcc/testsuite/gcc.dg/tree-prof/cold_partition_label.c
@@ -9,6 +9,10 @@ const char *sarr[SIZE];
 const char *buf_hot;
 const char *buf_cold;
 
+#ifndef ITER
+#define ITER 100
+#endif
+
 __attribute__((noinline))
 void 
 foo (int path)
@@ -32,7 +36,7 @@ main (int argc, char *argv[])
   int i;
   buf_hot =  "hello";
   buf_cold = "world";
-  for (i = 0; i < 100; i++)
+  for (i = 0; i < ITER; i++)
 foo (argc);
   return 0;
 }
diff --git a/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c 
b/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c
index 58109d54dc7..32d22c69c6c 100644
--- a/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c
+++ b/gcc/testsuite/gcc.dg/tree-prof/crossmodule-indircall-1.c
@@ -2,6 +2,10 @@
 /* { dg-additional-sources "crossmodule-indircall-1a.c" } */
 /* { dg-options "-O3 -flto -DDOJOB=1" } */
 
+#ifndef ITER
+#define ITER 1000
+#endif
+
 int a;
 extern void (*p[2])(int n);
 void abort (void);
@@ -10,12 +14,12 @@ main()
 { int i;
 
   /* This call shall be converted.  */
-  for (i = 0;i<1000;i++)
+  for (i = 0;i

Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-21 Thread Hans-Peter Nilsson
On Tue, 18 Dec 2018, Andi Kleen wrote:

> > Yes, take g++.dg/tree-prof/morefunc.C as an example:
> > -  int i;
> > -  for (i = 0; i < 1000; i++)
> > +  int i, j;
> > +  for (i = 0; i < 100; i++)
> > +for (j = 0; j < 50; j++)
> >   g += tc->foo();
> > if (g<100) g++;
> >  }
> > @@ -27,8 +28,9 @@ void test1 (A *tc)
> >  static __attribute__((always_inline))
> >  void test2 (B *tc)
> >  {
> > -  int i;
> > +  int i, j;
> >for (i = 0; i < 100; i++)
> > +for (j = 0; j < 50; j++)
> >
> > I have to increase loop count like this to get stable pass on my
> > machine.  The original count (1000) is too small to be sampled.
>
> IIRC It was originally higher, but people running on slow simulators 
> complained,
> so it was reduced.  Perhaps we need some way to detect in the test suite
> that the test runs on a real CPU.

Doesn't check_effective_target_simulator work here?
See e.g. libstdc++-v3/testsuite/25_algorithms/heap/moveable2.cc
for an example.

brgds, H-P


Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-19 Thread Andi Kleen
On Wed, Dec 19, 2018 at 06:28:29PM +0100, Richard Biener wrote:
> On Wed, Dec 19, 2018 at 4:41 PM Andi Kleen  wrote:
> >
> > > > We can combine the two together, increasing iteration count and
> > > > decreasing perf count at the same time.  What count would you suggest
> > > > from your experience?
> > >
> > > Can we instead for the tests where we want to test profile use/merge
> > > elide the profiling step and supply the "raw" data in an testsuite 
> > > alternate
> > > file instead?
> >
> > That would be possible, but a drawback is that we wouldn't have an
> > "end2end" test anymore that also tests the interaction with perf
> > and autofdo. Would be good to test these cases too, there were regressions
> > in this before.
> 
> Sure.
> 
> > But perhaps splitting that into two separate tests is reasonable,
> > with the majority of tests running with fake data.
> >
> > This would have the advantage that gcc developers who don't
> > have an autofdo setup (e.g. missing tools or running in virtualization
> > with PMU disabled) would still do most of the regression tests.
> 
> Yes, I think the pros outweight the cons here.  Well, at least if
> generating such data that works on multiple archs is even possible?

The gcov data that comes out of autofdo is architecture independent
as far as I know. It's mainly counts per line.

In fact even the perf input data should be fairly architecture
independent (except perhaps for endian)

I think it would need a way to write gcov data using text input
(unless you want to put a lot of binaries into the repository)

Also it would need to be adjusted every time a line number
changes in the test cases. I guess best would be if dejagnu
could somehow generate it from test case comments, but I don't
know how complicated that would be.

Doing such updates would be likely difficult with binaries.

In the future if we ever re-add discriminator support
again it would also need some way to specify the correct
discriminator.

I guess for simple test cases it could be ensured it is
always 0.

-Andi


Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-19 Thread Richard Biener
On Wed, Dec 19, 2018 at 4:41 PM Andi Kleen  wrote:
>
> > > We can combine the two together, increasing iteration count and
> > > decreasing perf count at the same time.  What count would you suggest
> > > from your experience?
> >
> > Can we instead for the tests where we want to test profile use/merge
> > elide the profiling step and supply the "raw" data in an testsuite alternate
> > file instead?
>
> That would be possible, but a drawback is that we wouldn't have an
> "end2end" test anymore that also tests the interaction with perf
> and autofdo. Would be good to test these cases too, there were regressions
> in this before.

Sure.

> But perhaps splitting that into two separate tests is reasonable,
> with the majority of tests running with fake data.
>
> This would have the advantage that gcc developers who don't
> have an autofdo setup (e.g. missing tools or running in virtualization
> with PMU disabled) would still do most of the regression tests.

Yes, I think the pros outweight the cons here.  Well, at least if
generating such data that works on multiple archs is even possible?

Richard.

> -Andi


Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-19 Thread Andi Kleen
On Wed, Dec 19, 2018 at 12:08:35PM +0800, Bin.Cheng wrote:
> On Wed, Dec 19, 2018 at 12:00 PM Andi Kleen  wrote:
> >
> > On Wed, Dec 19, 2018 at 10:01:15AM +0800, Bin.Cheng wrote:
> > > On Tue, Dec 18, 2018 at 7:15 PM Bin.Cheng  wrote:
> > > >
> > > > On Sun, Dec 16, 2018 at 9:11 AM Andi Kleen  wrote:
> > > > >
> > > > > "bin.cheng"  writes:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Due to ICE and mal-functional bugs, indirect call value profile 
> > > > > > transformation
> > > > > > is disabled on GCC-7/8/trunk.  This patch restores the 
> > > > > > transformation.  The
> > > > > > main issue is AutoFDO should store cgraph_node's profile_id of 
> > > > > > callee func in
> > > > > > the first histogram value's counter, rather than pointer to 
> > > > > > callee's name string
> > > > > > as it is now.
> > > > > > With the patch, some "Indirect call -> direct call" tests pass with 
> > > > > > autofdo, while
> > > > > > others are unstable.  I think the instability is caused by poor 
> > > > > > perf data collected
> > > > > > during regrets run, and can confirm these tests pass if good perf 
> > > > > > data could be
> > > > > > collected in manual experiments.
> > > > >
> > > > > Would be good to make the tests stable, otherwise we'll just have
> > > > > regressions in the future again.
> > > > >
> > > > > The problem is that the tests don't run long enough and don't get 
> > > > > enough samples?
> > > > Yes, take g++.dg/tree-prof/morefunc.C as an example:
> > > > -  int i;
> > > > -  for (i = 0; i < 1000; i++)
> > > > +  int i, j;
> > > > +  for (i = 0; i < 100; i++)
> > > > +for (j = 0; j < 50; j++)
> > > >   g += tc->foo();
> > > > if (g<100) g++;
> > > >  }
> > > > @@ -27,8 +28,9 @@ void test1 (A *tc)
> > > >  static __attribute__((always_inline))
> > > >  void test2 (B *tc)
> > > >  {
> > > > -  int i;
> > > > +  int i, j;
> > > >for (i = 0; i < 100; i++)
> > > > +for (j = 0; j < 50; j++)
> > > >
> > > > I have to increase loop count like this to get stable pass on my
> > > > machine.  The original count (1000) is too small to be sampled.
> > > >
> > > > >
> > > > > Could add some loop?
> > > > > Or possibly increase the sampling frequency in perf (-F or -c)?
> > > > Maybe, I will have a try.
> > > Turned out all "Indirect call" test can be resolved by adding -c 100
> > > to perf command line:
> > > diff --git a/gcc/config/i386/gcc-auto-profile 
> > > b/gcc/config/i386/gcc-auto-profile
> > > ...
> > > -exec perf record -e $E -b "$@"
> > > +exec perf record -e $E -c 100 -b "$@"
> > >
> > > Is 100 too small here?  Or is it fine for all scenarios?
> >
> > -c 100 is risky because it can cause perf throttling, which
> > makes it lose data.
> Right, it looks suspicious to me too.
> 
> >
> > perf has a limiter that if the PMU handler uses too much CPU
> > time it stops measuring for some time. A PMI is 10k+ cycles,
> > so doing one every 100 branches is a lot of CPU time.
> >
> > I wouldn't go down that low. It is better to increase the
> > iteration count.
> We can combine the two together, increasing iteration count and
> decreasing perf count at the same time.  What count would you suggest
> from your experience?

Normally nothing less than 50k for a common event like branches.

But for such a limited test 10k might still work, as long
as the runtime is fair controlled.

We would probably need to ensure at least 10+ samples, so
that would be 100k iterations.

iirc that was what we used originally, until people
complained about the simulator run times.

-Andi


Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-19 Thread Andi Kleen
> > We can combine the two together, increasing iteration count and
> > decreasing perf count at the same time.  What count would you suggest
> > from your experience?
> 
> Can we instead for the tests where we want to test profile use/merge
> elide the profiling step and supply the "raw" data in an testsuite alternate
> file instead?

That would be possible, but a drawback is that we wouldn't have an
"end2end" test anymore that also tests the interaction with perf
and autofdo. Would be good to test these cases too, there were regressions
in this before.

But perhaps splitting that into two separate tests is reasonable,
with the majority of tests running with fake data.

This would have the advantage that gcc developers who don't
have an autofdo setup (e.g. missing tools or running in virtualization
with PMU disabled) would still do most of the regression tests.

-Andi


Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-19 Thread Richard Biener
On Wed, Dec 19, 2018 at 5:08 AM Bin.Cheng  wrote:
>
> On Wed, Dec 19, 2018 at 12:00 PM Andi Kleen  wrote:
> >
> > On Wed, Dec 19, 2018 at 10:01:15AM +0800, Bin.Cheng wrote:
> > > On Tue, Dec 18, 2018 at 7:15 PM Bin.Cheng  wrote:
> > > >
> > > > On Sun, Dec 16, 2018 at 9:11 AM Andi Kleen  wrote:
> > > > >
> > > > > "bin.cheng"  writes:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Due to ICE and mal-functional bugs, indirect call value profile 
> > > > > > transformation
> > > > > > is disabled on GCC-7/8/trunk.  This patch restores the 
> > > > > > transformation.  The
> > > > > > main issue is AutoFDO should store cgraph_node's profile_id of 
> > > > > > callee func in
> > > > > > the first histogram value's counter, rather than pointer to 
> > > > > > callee's name string
> > > > > > as it is now.
> > > > > > With the patch, some "Indirect call -> direct call" tests pass with 
> > > > > > autofdo, while
> > > > > > others are unstable.  I think the instability is caused by poor 
> > > > > > perf data collected
> > > > > > during regrets run, and can confirm these tests pass if good perf 
> > > > > > data could be
> > > > > > collected in manual experiments.
> > > > >
> > > > > Would be good to make the tests stable, otherwise we'll just have
> > > > > regressions in the future again.
> > > > >
> > > > > The problem is that the tests don't run long enough and don't get 
> > > > > enough samples?
> > > > Yes, take g++.dg/tree-prof/morefunc.C as an example:
> > > > -  int i;
> > > > -  for (i = 0; i < 1000; i++)
> > > > +  int i, j;
> > > > +  for (i = 0; i < 100; i++)
> > > > +for (j = 0; j < 50; j++)
> > > >   g += tc->foo();
> > > > if (g<100) g++;
> > > >  }
> > > > @@ -27,8 +28,9 @@ void test1 (A *tc)
> > > >  static __attribute__((always_inline))
> > > >  void test2 (B *tc)
> > > >  {
> > > > -  int i;
> > > > +  int i, j;
> > > >for (i = 0; i < 100; i++)
> > > > +for (j = 0; j < 50; j++)
> > > >
> > > > I have to increase loop count like this to get stable pass on my
> > > > machine.  The original count (1000) is too small to be sampled.
> > > >
> > > > >
> > > > > Could add some loop?
> > > > > Or possibly increase the sampling frequency in perf (-F or -c)?
> > > > Maybe, I will have a try.
> > > Turned out all "Indirect call" test can be resolved by adding -c 100
> > > to perf command line:
> > > diff --git a/gcc/config/i386/gcc-auto-profile 
> > > b/gcc/config/i386/gcc-auto-profile
> > > ...
> > > -exec perf record -e $E -b "$@"
> > > +exec perf record -e $E -c 100 -b "$@"
> > >
> > > Is 100 too small here?  Or is it fine for all scenarios?
> >
> > -c 100 is risky because it can cause perf throttling, which
> > makes it lose data.
> Right, it looks suspicious to me too.
>
> >
> > perf has a limiter that if the PMU handler uses too much CPU
> > time it stops measuring for some time. A PMI is 10k+ cycles,
> > so doing one every 100 branches is a lot of CPU time.
> >
> > I wouldn't go down that low. It is better to increase the
> > iteration count.
> We can combine the two together, increasing iteration count and
> decreasing perf count at the same time.  What count would you suggest
> from your experience?

Can we instead for the tests where we want to test profile use/merge
elide the profiling step and supply the "raw" data in an testsuite alternate
file instead?

Richard.

> Thanks,
> bin
> >
> > -Andi


Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-18 Thread Bin.Cheng
On Wed, Dec 19, 2018 at 12:00 PM Andi Kleen  wrote:
>
> On Wed, Dec 19, 2018 at 10:01:15AM +0800, Bin.Cheng wrote:
> > On Tue, Dec 18, 2018 at 7:15 PM Bin.Cheng  wrote:
> > >
> > > On Sun, Dec 16, 2018 at 9:11 AM Andi Kleen  wrote:
> > > >
> > > > "bin.cheng"  writes:
> > > >
> > > > > Hi,
> > > > >
> > > > > Due to ICE and mal-functional bugs, indirect call value profile 
> > > > > transformation
> > > > > is disabled on GCC-7/8/trunk.  This patch restores the 
> > > > > transformation.  The
> > > > > main issue is AutoFDO should store cgraph_node's profile_id of callee 
> > > > > func in
> > > > > the first histogram value's counter, rather than pointer to callee's 
> > > > > name string
> > > > > as it is now.
> > > > > With the patch, some "Indirect call -> direct call" tests pass with 
> > > > > autofdo, while
> > > > > others are unstable.  I think the instability is caused by poor perf 
> > > > > data collected
> > > > > during regrets run, and can confirm these tests pass if good perf 
> > > > > data could be
> > > > > collected in manual experiments.
> > > >
> > > > Would be good to make the tests stable, otherwise we'll just have
> > > > regressions in the future again.
> > > >
> > > > The problem is that the tests don't run long enough and don't get 
> > > > enough samples?
> > > Yes, take g++.dg/tree-prof/morefunc.C as an example:
> > > -  int i;
> > > -  for (i = 0; i < 1000; i++)
> > > +  int i, j;
> > > +  for (i = 0; i < 100; i++)
> > > +for (j = 0; j < 50; j++)
> > >   g += tc->foo();
> > > if (g<100) g++;
> > >  }
> > > @@ -27,8 +28,9 @@ void test1 (A *tc)
> > >  static __attribute__((always_inline))
> > >  void test2 (B *tc)
> > >  {
> > > -  int i;
> > > +  int i, j;
> > >for (i = 0; i < 100; i++)
> > > +for (j = 0; j < 50; j++)
> > >
> > > I have to increase loop count like this to get stable pass on my
> > > machine.  The original count (1000) is too small to be sampled.
> > >
> > > >
> > > > Could add some loop?
> > > > Or possibly increase the sampling frequency in perf (-F or -c)?
> > > Maybe, I will have a try.
> > Turned out all "Indirect call" test can be resolved by adding -c 100
> > to perf command line:
> > diff --git a/gcc/config/i386/gcc-auto-profile 
> > b/gcc/config/i386/gcc-auto-profile
> > ...
> > -exec perf record -e $E -b "$@"
> > +exec perf record -e $E -c 100 -b "$@"
> >
> > Is 100 too small here?  Or is it fine for all scenarios?
>
> -c 100 is risky because it can cause perf throttling, which
> makes it lose data.
Right, it looks suspicious to me too.

>
> perf has a limiter that if the PMU handler uses too much CPU
> time it stops measuring for some time. A PMI is 10k+ cycles,
> so doing one every 100 branches is a lot of CPU time.
>
> I wouldn't go down that low. It is better to increase the
> iteration count.
We can combine the two together, increasing iteration count and
decreasing perf count at the same time.  What count would you suggest
from your experience?

Thanks,
bin
>
> -Andi


Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-18 Thread Andi Kleen
On Wed, Dec 19, 2018 at 10:01:15AM +0800, Bin.Cheng wrote:
> On Tue, Dec 18, 2018 at 7:15 PM Bin.Cheng  wrote:
> >
> > On Sun, Dec 16, 2018 at 9:11 AM Andi Kleen  wrote:
> > >
> > > "bin.cheng"  writes:
> > >
> > > > Hi,
> > > >
> > > > Due to ICE and mal-functional bugs, indirect call value profile 
> > > > transformation
> > > > is disabled on GCC-7/8/trunk.  This patch restores the transformation.  
> > > > The
> > > > main issue is AutoFDO should store cgraph_node's profile_id of callee 
> > > > func in
> > > > the first histogram value's counter, rather than pointer to callee's 
> > > > name string
> > > > as it is now.
> > > > With the patch, some "Indirect call -> direct call" tests pass with 
> > > > autofdo, while
> > > > others are unstable.  I think the instability is caused by poor perf 
> > > > data collected
> > > > during regrets run, and can confirm these tests pass if good perf data 
> > > > could be
> > > > collected in manual experiments.
> > >
> > > Would be good to make the tests stable, otherwise we'll just have
> > > regressions in the future again.
> > >
> > > The problem is that the tests don't run long enough and don't get enough 
> > > samples?
> > Yes, take g++.dg/tree-prof/morefunc.C as an example:
> > -  int i;
> > -  for (i = 0; i < 1000; i++)
> > +  int i, j;
> > +  for (i = 0; i < 100; i++)
> > +for (j = 0; j < 50; j++)
> >   g += tc->foo();
> > if (g<100) g++;
> >  }
> > @@ -27,8 +28,9 @@ void test1 (A *tc)
> >  static __attribute__((always_inline))
> >  void test2 (B *tc)
> >  {
> > -  int i;
> > +  int i, j;
> >for (i = 0; i < 100; i++)
> > +for (j = 0; j < 50; j++)
> >
> > I have to increase loop count like this to get stable pass on my
> > machine.  The original count (1000) is too small to be sampled.
> >
> > >
> > > Could add some loop?
> > > Or possibly increase the sampling frequency in perf (-F or -c)?
> > Maybe, I will have a try.
> Turned out all "Indirect call" test can be resolved by adding -c 100
> to perf command line:
> diff --git a/gcc/config/i386/gcc-auto-profile 
> b/gcc/config/i386/gcc-auto-profile
> ...
> -exec perf record -e $E -b "$@"
> +exec perf record -e $E -c 100 -b "$@"
> 
> Is 100 too small here?  Or is it fine for all scenarios?

-c 100 is risky because it can cause perf throttling, which
makes it lose data.

perf has a limiter that if the PMU handler uses too much CPU
time it stops measuring for some time. A PMI is 10k+ cycles,
so doing one every 100 branches is a lot of CPU time.

I wouldn't go down that low. It is better to increase the
iteration count.

-Andi


Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-18 Thread Andi Kleen
On Wed, Dec 19, 2018 at 09:26:51AM +0800, Bin.Cheng wrote:
> On Wed, Dec 19, 2018 at 5:27 AM Andi Kleen  wrote:
> >
> > > Yes, take g++.dg/tree-prof/morefunc.C as an example:
> > > -  int i;
> > > -  for (i = 0; i < 1000; i++)
> > > +  int i, j;
> > > +  for (i = 0; i < 100; i++)
> > > +for (j = 0; j < 50; j++)
> > >   g += tc->foo();
> > > if (g<100) g++;
> > >  }
> > > @@ -27,8 +28,9 @@ void test1 (A *tc)
> > >  static __attribute__((always_inline))
> > >  void test2 (B *tc)
> > >  {
> > > -  int i;
> > > +  int i, j;
> > >for (i = 0; i < 100; i++)
> > > +for (j = 0; j < 50; j++)
> > >
> > > I have to increase loop count like this to get stable pass on my
> > > machine.  The original count (1000) is too small to be sampled.
> >
> > IIRC It was originally higher, but people running on slow simulators 
> > complained,
> > so it was reduced.  Perhaps we need some way to detect in the test suite
> > that the test runs on a real CPU.
> Is there concise way to do this, given gcc may be run on all kinds of
> virtual scenarios?

Virtual should be fine too, just simulators are too slow.

I hope there is, because we certainly need a solution for production
ready autofdo.

Or perhaps could just check if perf is working and only
run the tests if that is true. The TCL code already
checks that. Just would need to pass that information
somehow as a define.

Overall I suspect far more test coverage is needed
to make it solid. The existing tests are not that great.


> 
> >
> > >
> > > > > FYI, an update about AutoFDO status:
> > > > > All AutoFDO ICEs in regtest are fixed, while several tests still 
> > > > > failing fall in below
> > > > > three categories:
> > > >
> > > > Great!
> > > >
> > > > Of course it still ICEs with LTO?
> > > >
> > > > Right now there is no test case for this I think. Probably one should 
> > > > be added.
> >
> >
> > Any comments on this?
> We'd like to further investigate AutoFDO+LTO, may I ask what the
> status is (or was)?  Any background elaboration about this would be
> appreciated.

It just never worked and ICEs very quickly if you try it.  

There's an open PR (PR71672)

There are some other open issues with autofdo BTW, e.g. the
old 4.9 google branch still has more features than mainline.
For example it supported discriminators, so can distinguish more
than one basic block per source line.

The last time I tested the gains with mainline autofdo
were also significantly less than 4.9-google, so there might
be other tunings missing.

-Andi



Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-18 Thread Bin.Cheng
On Tue, Dec 18, 2018 at 7:15 PM Bin.Cheng  wrote:
>
> On Sun, Dec 16, 2018 at 9:11 AM Andi Kleen  wrote:
> >
> > "bin.cheng"  writes:
> >
> > > Hi,
> > >
> > > Due to ICE and mal-functional bugs, indirect call value profile 
> > > transformation
> > > is disabled on GCC-7/8/trunk.  This patch restores the transformation.  
> > > The
> > > main issue is AutoFDO should store cgraph_node's profile_id of callee 
> > > func in
> > > the first histogram value's counter, rather than pointer to callee's name 
> > > string
> > > as it is now.
> > > With the patch, some "Indirect call -> direct call" tests pass with 
> > > autofdo, while
> > > others are unstable.  I think the instability is caused by poor perf data 
> > > collected
> > > during regrets run, and can confirm these tests pass if good perf data 
> > > could be
> > > collected in manual experiments.
> >
> > Would be good to make the tests stable, otherwise we'll just have
> > regressions in the future again.
> >
> > The problem is that the tests don't run long enough and don't get enough 
> > samples?
> Yes, take g++.dg/tree-prof/morefunc.C as an example:
> -  int i;
> -  for (i = 0; i < 1000; i++)
> +  int i, j;
> +  for (i = 0; i < 100; i++)
> +for (j = 0; j < 50; j++)
>   g += tc->foo();
> if (g<100) g++;
>  }
> @@ -27,8 +28,9 @@ void test1 (A *tc)
>  static __attribute__((always_inline))
>  void test2 (B *tc)
>  {
> -  int i;
> +  int i, j;
>for (i = 0; i < 100; i++)
> +for (j = 0; j < 50; j++)
>
> I have to increase loop count like this to get stable pass on my
> machine.  The original count (1000) is too small to be sampled.
>
> >
> > Could add some loop?
> > Or possibly increase the sampling frequency in perf (-F or -c)?
> Maybe, I will have a try.
Turned out all "Indirect call" test can be resolved by adding -c 100
to perf command line:
diff --git a/gcc/config/i386/gcc-auto-profile b/gcc/config/i386/gcc-auto-profile
...
-exec perf record -e $E -b "$@"
+exec perf record -e $E -c 100 -b "$@"

Is 100 too small here?  Or is it fine for all scenarios?

Thanks,
bin

> > Or run them multiple times and use gcov_merge to merge the files?
> Without changing loop count or sampling frequency, this is not likely
> to be helpful, since perf doesn't hit the small loop in most cases.
>
> Thanks,
> bin
> >
> >
> > > FYI, an update about AutoFDO status:
> > > All AutoFDO ICEs in regtest are fixed, while several tests still failing 
> > > fall in below
> > > three categories:
> >
> > Great!
> >
> > Of course it still ICEs with LTO?
> >
> > Right now there is no test case for this I think. Probably one should be 
> > added.
> >
> > -Andi


Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-18 Thread Bin.Cheng
On Wed, Dec 19, 2018 at 5:27 AM Andi Kleen  wrote:
>
> > Yes, take g++.dg/tree-prof/morefunc.C as an example:
> > -  int i;
> > -  for (i = 0; i < 1000; i++)
> > +  int i, j;
> > +  for (i = 0; i < 100; i++)
> > +for (j = 0; j < 50; j++)
> >   g += tc->foo();
> > if (g<100) g++;
> >  }
> > @@ -27,8 +28,9 @@ void test1 (A *tc)
> >  static __attribute__((always_inline))
> >  void test2 (B *tc)
> >  {
> > -  int i;
> > +  int i, j;
> >for (i = 0; i < 100; i++)
> > +for (j = 0; j < 50; j++)
> >
> > I have to increase loop count like this to get stable pass on my
> > machine.  The original count (1000) is too small to be sampled.
>
> IIRC It was originally higher, but people running on slow simulators 
> complained,
> so it was reduced.  Perhaps we need some way to detect in the test suite
> that the test runs on a real CPU.
Is there concise way to do this, given gcc may be run on all kinds of
virtual scenarios?

>
> >
> > > > FYI, an update about AutoFDO status:
> > > > All AutoFDO ICEs in regtest are fixed, while several tests still 
> > > > failing fall in below
> > > > three categories:
> > >
> > > Great!
> > >
> > > Of course it still ICEs with LTO?
> > >
> > > Right now there is no test case for this I think. Probably one should be 
> > > added.
>
>
> Any comments on this?
We'd like to further investigate AutoFDO+LTO, may I ask what the
status is (or was)?  Any background elaboration about this would be
appreciated.

Thanks,
bin
>
> -Andi


Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-18 Thread Andi Kleen
> Yes, take g++.dg/tree-prof/morefunc.C as an example:
> -  int i;
> -  for (i = 0; i < 1000; i++)
> +  int i, j;
> +  for (i = 0; i < 100; i++)
> +for (j = 0; j < 50; j++)
>   g += tc->foo();
> if (g<100) g++;
>  }
> @@ -27,8 +28,9 @@ void test1 (A *tc)
>  static __attribute__((always_inline))
>  void test2 (B *tc)
>  {
> -  int i;
> +  int i, j;
>for (i = 0; i < 100; i++)
> +for (j = 0; j < 50; j++)
> 
> I have to increase loop count like this to get stable pass on my
> machine.  The original count (1000) is too small to be sampled.

IIRC It was originally higher, but people running on slow simulators complained,
so it was reduced.  Perhaps we need some way to detect in the test suite
that the test runs on a real CPU.

> 
> > > FYI, an update about AutoFDO status:
> > > All AutoFDO ICEs in regtest are fixed, while several tests still failing 
> > > fall in below
> > > three categories:
> >
> > Great!
> >
> > Of course it still ICEs with LTO?
> >
> > Right now there is no test case for this I think. Probably one should be 
> > added.


Any comments on this?

-Andi


Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-18 Thread Bin.Cheng
On Sun, Dec 16, 2018 at 9:11 AM Andi Kleen  wrote:
>
> "bin.cheng"  writes:
>
> > Hi,
> >
> > Due to ICE and mal-functional bugs, indirect call value profile 
> > transformation
> > is disabled on GCC-7/8/trunk.  This patch restores the transformation.  The
> > main issue is AutoFDO should store cgraph_node's profile_id of callee func 
> > in
> > the first histogram value's counter, rather than pointer to callee's name 
> > string
> > as it is now.
> > With the patch, some "Indirect call -> direct call" tests pass with 
> > autofdo, while
> > others are unstable.  I think the instability is caused by poor perf data 
> > collected
> > during regrets run, and can confirm these tests pass if good perf data 
> > could be
> > collected in manual experiments.
>
> Would be good to make the tests stable, otherwise we'll just have
> regressions in the future again.
>
> The problem is that the tests don't run long enough and don't get enough 
> samples?
Yes, take g++.dg/tree-prof/morefunc.C as an example:
-  int i;
-  for (i = 0; i < 1000; i++)
+  int i, j;
+  for (i = 0; i < 100; i++)
+for (j = 0; j < 50; j++)
  g += tc->foo();
if (g<100) g++;
 }
@@ -27,8 +28,9 @@ void test1 (A *tc)
 static __attribute__((always_inline))
 void test2 (B *tc)
 {
-  int i;
+  int i, j;
   for (i = 0; i < 100; i++)
+for (j = 0; j < 50; j++)

I have to increase loop count like this to get stable pass on my
machine.  The original count (1000) is too small to be sampled.

>
> Could add some loop?
> Or possibly increase the sampling frequency in perf (-F or -c)?
Maybe, I will have a try.
> Or run them multiple times and use gcov_merge to merge the files?
Without changing loop count or sampling frequency, this is not likely
to be helpful, since perf doesn't hit the small loop in most cases.

Thanks,
bin
>
>
> > FYI, an update about AutoFDO status:
> > All AutoFDO ICEs in regtest are fixed, while several tests still failing 
> > fall in below
> > three categories:
>
> Great!
>
> Of course it still ICEs with LTO?
>
> Right now there is no test case for this I think. Probably one should be 
> added.
>
> -Andi


Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-15 Thread Andi Kleen
"bin.cheng"  writes:

> Hi,
>
> Due to ICE and mal-functional bugs, indirect call value profile transformation
> is disabled on GCC-7/8/trunk.  This patch restores the transformation.  The
> main issue is AutoFDO should store cgraph_node's profile_id of callee func in
> the first histogram value's counter, rather than pointer to callee's name 
> string
> as it is now.
> With the patch, some "Indirect call -> direct call" tests pass with autofdo, 
> while
> others are unstable.  I think the instability is caused by poor perf data 
> collected
> during regrets run, and can confirm these tests pass if good perf data could 
> be
> collected in manual experiments.

Would be good to make the tests stable, otherwise we'll just have
regressions in the future again.

The problem is that the tests don't run long enough and don't get enough 
samples?

Could add some loop?
Or possibly increase the sampling frequency in perf (-F or -c)?
Or run them multiple times and use gcov_merge to merge the files?


> FYI, an update about AutoFDO status: 
> All AutoFDO ICEs in regtest are fixed, while several tests still failing fall 
> in below
> three categories:

Great!

Of course it still ICEs with LTO?

Right now there is no test case for this I think. Probably one should be added.

-Andi


Re: [PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-13 Thread Jeff Law
On 12/12/18 8:50 PM, bin.cheng wrote:
> Hi,
> 
> Due to ICE and mal-functional bugs, indirect call value profile transformation
> is disabled on GCC-7/8/trunk.  This patch restores the transformation.  The
> main issue is AutoFDO should store cgraph_node's profile_id of callee func in
> the first histogram value's counter, rather than pointer to callee's name 
> string
> as it is now.
> With the patch, some "Indirect call -> direct call" tests pass with autofdo, 
> while
> others are unstable.  I think the instability is caused by poor perf data 
> collected
> during regrets run, and can confirm these tests pass if good perf data could 
> be
> collected in manual experiments.
> 
> Bootstrap and test along with previous patches.  Is it OK?
> 
> FYI, an update about AutoFDO status: 
> All AutoFDO ICEs in regtest are fixed, while several tests still failing fall 
> in below
> three categories:
> 
> Unstable indirect call value profile transformation:
> FAIL: g++.dg/tree-prof/indir-call-prof.C scan-ipa-dump afdo "Indirect call -> 
> direct call.* AA transformation on insn"
> FAIL: g++.dg/tree-prof/morefunc.C scan-ipa-dump-times afdo "Indirect call -> 
> direct call" 2
> FAIL: g++.dg/tree-prof/pr35545.C scan-ipa-dump profile_estimate "Indirect 
> call -> direct call"
> 
> loop peeling case because we don't honor autofdo profile count as reliable:
> FAIL: gcc.dg/tree-prof/peel-1.c scan-tree-dump cunroll "Peeled loop ., 1 
> times"
> 
> cold/hot partition cases:
> FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler foo[._]+cold
> FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler size[ 
> \ta-zA-Z0-0]+foo[._]+cold
> FAIL: gcc.dg/tree-prof/section-attr-1.c scan-assembler .section[\t 
> ]*.text.unlikely[\\n\\r]+[\t ]*.size[\t ]*foo.cold
> FAIL: gcc.dg/tree-prof/section-attr-2.c scan-assembler .section[\t 
> ]*.text.unlikely[\\n\\r]+[\t ]*.size[\t ]*foo.cold
> FAIL: gcc.dg/tree-prof/section-attr-3.c scan-assembler .section[\t 
> ]*.text.unlikely[\\n\\r]+[\t ]*.size[\t ]*foo.cold
> These are more difficult to enable because we can't simply treat autofdo::zero
> count as cold, it's just too many.
> 
> Besides regtest, I run autofdo with kernel/mysql-server, the build and 
> performance
> match expectations now, but I haven't run autofdo with any spec yet.
> 
> Thanks,
> bin
> 
> 2018-12-13  Bin Cheng  
> 
> * auto-profile.c (afdo_indirect_call): Skip generating histogram
> value if we can't find cgraph_node for then indirected callee.  Save
> profile_id of the cgraph_node in histogram value's first counter.
> * value-prof.c (gimple_value_profile_transformations): Don't skip
> for flag_auto_profile.
OK
jeff




[PATCH AutoFDO]Restoring indirect call value profile transformation

2018-12-12 Thread bin.cheng
Hi,

Due to ICE and mal-functional bugs, indirect call value profile transformation
is disabled on GCC-7/8/trunk.  This patch restores the transformation.  The
main issue is AutoFDO should store cgraph_node's profile_id of callee func in
the first histogram value's counter, rather than pointer to callee's name string
as it is now.
With the patch, some "Indirect call -> direct call" tests pass with autofdo, 
while
others are unstable.  I think the instability is caused by poor perf data 
collected
during regrets run, and can confirm these tests pass if good perf data could be
collected in manual experiments.

Bootstrap and test along with previous patches.  Is it OK?

FYI, an update about AutoFDO status: 
All AutoFDO ICEs in regtest are fixed, while several tests still failing fall 
in below
three categories:

Unstable indirect call value profile transformation:
FAIL: g++.dg/tree-prof/indir-call-prof.C scan-ipa-dump afdo "Indirect call -> 
direct call.* AA transformation on insn"
FAIL: g++.dg/tree-prof/morefunc.C scan-ipa-dump-times afdo "Indirect call -> 
direct call" 2
FAIL: g++.dg/tree-prof/pr35545.C scan-ipa-dump profile_estimate "Indirect call 
-> direct call"

loop peeling case because we don't honor autofdo profile count as reliable:
FAIL: gcc.dg/tree-prof/peel-1.c scan-tree-dump cunroll "Peeled loop ., 1 times"

cold/hot partition cases:
FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler foo[._]+cold
FAIL: gcc.dg/tree-prof/cold_partition_label.c scan-assembler size[ 
\ta-zA-Z0-0]+foo[._]+cold
FAIL: gcc.dg/tree-prof/section-attr-1.c scan-assembler .section[\t 
]*.text.unlikely[\\n\\r]+[\t ]*.size[\t ]*foo.cold
FAIL: gcc.dg/tree-prof/section-attr-2.c scan-assembler .section[\t 
]*.text.unlikely[\\n\\r]+[\t ]*.size[\t ]*foo.cold
FAIL: gcc.dg/tree-prof/section-attr-3.c scan-assembler .section[\t 
]*.text.unlikely[\\n\\r]+[\t ]*.size[\t ]*foo.cold
These are more difficult to enable because we can't simply treat autofdo::zero
count as cold, it's just too many.

Besides regtest, I run autofdo with kernel/mysql-server, the build and 
performance
match expectations now, but I haven't run autofdo with any spec yet.

Thanks,
bin

2018-12-13  Bin Cheng  

* auto-profile.c (afdo_indirect_call): Skip generating histogram
value if we can't find cgraph_node for then indirected callee.  Save
profile_id of the cgraph_node in histogram value's first counter.
* value-prof.c (gimple_value_profile_transformations): Don't skip
for flag_auto_profile.

0007-Fix-autofdo-indirect-call-value-prof-transformation.patch
Description: Binary data