Re: [PATCH PR79347]Maintain profile counter information in vect_do_peeling

2017-02-15 Thread Jan Hubicka
> On Tue, Feb 14, 2017 at 2:13 PM, Bin.Cheng  wrote:
> > On Tue, Feb 14, 2017 at 1:57 PM, Jan Hubicka  wrote:
> >>> Thanks,
> >>> bin
> >>> 2017-02-13  Bin Cheng  
> >>>
> >>>   PR tree-optimization/79347
> >>>   * tree-vect-loop-manip.c (apply_probability_for_bb): New function.
> >>>   (vect_do_peeling): Maintain profile counters during peeling.
> >>>
> >>> gcc/testsuite/ChangeLog
> >>> 2017-02-13  Bin Cheng  
> >>>
> >>>   PR tree-optimization/79347
> >>>   * gcc.dg/vect/pr79347.c: New test.
> >>
> >>> diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c 
> >>> b/gcc/testsuite/gcc.dg/vect/pr79347.c
> >>> new file mode 100644
> >>> index 000..586c638
> >>> --- /dev/null
> >>> +++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
> >>> @@ -0,0 +1,13 @@
> >>> +/* { dg-do compile } */
> >>> +/* { dg-require-effective-target vect_int } */
> >>> +/* { dg-additional-options "-fdump-tree-vect-all" } */
> >>> +
> >>> +short *a;
> >>> +int c;
> >>> +void n(void)
> >>> +{
> >>> +  for (int i = 0; i >>> +a[i]++;
> >>> +}
> >>
> >> Thanks for fixing the prologue.  I think there is still one extra problem 
> >> in the vectorizer.
> >> With the internal vectorized loop I now see:
> >>
> >> ;;   basic block 9, loop depth 1, count 0, freq 956, maybe hot
> >> ;;   Invalid sum of incoming frequencies 1961, should be 956
> >> ;;prev block 8, next block 10, flags: (NEW, REACHABLE, VISITED)
> >> ;;pred:   10 [100.0%]  (FALLTHRU,DFS_BACK,EXECUTABLE)
> >> ;;8 [100.0%]  (FALLTHRU)
> >>   # i_18 = PHI 
> >>   # vectp_a.13_66 = PHI 
> >>   # vectp_a.19_75 = PHI 
> >>   # ivtmp_78 = PHI 
> >>   _2 = (long unsigned int) i_18;
> >>   _3 = _2 * 2;
> >>   _4 = a.0_1 + _3;
> >>   vect__5.15_68 = MEM[(short int *)vectp_a.13_66];
> >>   _5 = *_4;
> >>   vect__6.16_69 = VIEW_CONVERT_EXPR >> short>(vect__5.15_68);
> >>   _6 = (unsigned short) _5;
> >>   vect__7.17_71 = vect__6.16_69 + vect_cst__70;
> >>   _7 = _6 + 1;
> >>   vect__8.18_72 = VIEW_CONVERT_EXPR(vect__7.17_71);
> >>   _8 = (short int) _7;
> >>   MEM[(short int *)vectp_a.19_75] = vect__8.18_72;
> >>   i_14 = i_18 + 1;
> >>   vectp_a.13_67 = vectp_a.13_66 + 16;
> >>   vectp_a.19_76 = vectp_a.19_75 + 16;
> >>   ivtmp_79 = ivtmp_78 + 1;
> >>   if (ivtmp_79 < bnd.10_59)
> >> goto ; [85.00%]
> >>   else
> >> goto ; [15.00%]
> >>
> >> So it seems that the frequency of the loop itself is unrealistically 
> >> scaled down.
> >> Before vetorizing the frequency is 8500 and predicted number of iterations 
> >> is
> >> 6.6.  Now the loop is intereed via BB 8 with frequency 1148, so the loop, 
> >> by
> >> exit probability exits with 15% probability and thus still has 6.6 
> >> iterations,
> >> but by BB frequencies its body executes fewer times than the preheader.
> >>
> >> Now this is a fragile area vectirizing loop should scale number of 
> >> iterations down
> >> 8 times. However guessed CFG profiles are always very "flat". Of course
> >> if loop iterated 6.6 times at the average vectorizing would not make any 
> >> sense.
> >> Making guessed profiles less flat is unrealistic, because average loop 
> >> iterates few times,
> >> but of course while vectorizing we make additional guess that the 
> >> vectorizable loops
> >> matters and the guessed profile is probably unrealistic.
> > That's what I mentioned in the original patch.  Vectorizer calls
> > scale_loop_profile in
> > function vect_transform_loop to scale down loop's frequency regardless 
> > mismatch
> > between loop and preheader/exit basic blocks.  In fact, after this
> > patch all mismatches
> > in vectorizer are introduced by this.  I don't see any way to keep
> > consistency beween
> > vectorized loop and the rest program without visiting whole CFG.  So
> > shall we skip
> > scaling down profile counters for vectorized loop?
> >
> >>
> >> GCC 6 seems however bit more consistent.
> >>> +/* Apply probability PROB to basic block BB and its single succ edge.  */
> >>> +
> >>> +static void
> >>> +apply_probability_for_bb (basic_block bb, int prob)
> >>> +{
> >>> +  bb->frequency = apply_probability (bb->frequency, prob);
> >>> +  bb->count = apply_probability (bb->count, prob);
> >>> +  gcc_assert (single_succ_p (bb));
> >>> +  single_succ_edge (bb)->count = bb->count;
> >>> +}
> >>> +
> >>>  /* Function vect_do_peeling.
> >>>
> >>> Input:
> >>> @@ -1690,7 +1701,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
> >>> niters, tree nitersm1,
> >>>   may be preferred.  */
> >>>basic_block anchor = loop_preheader_edge (loop)->src;
> >>>if (skip_vector)
> >>> -split_edge (loop_preheader_edge (loop));
> >>> +{
> >>> +  split_edge (loop_preheader_edge (loop));
> >>> +
> >>> +  /* Due to the order in which we peel 

Re: [PATCH PR79347]Maintain profile counter information in vect_do_peeling

2017-02-15 Thread Bin.Cheng
On Tue, Feb 14, 2017 at 2:13 PM, Bin.Cheng  wrote:
> On Tue, Feb 14, 2017 at 1:57 PM, Jan Hubicka  wrote:
>>> Thanks,
>>> bin
>>> 2017-02-13  Bin Cheng  
>>>
>>>   PR tree-optimization/79347
>>>   * tree-vect-loop-manip.c (apply_probability_for_bb): New function.
>>>   (vect_do_peeling): Maintain profile counters during peeling.
>>>
>>> gcc/testsuite/ChangeLog
>>> 2017-02-13  Bin Cheng  
>>>
>>>   PR tree-optimization/79347
>>>   * gcc.dg/vect/pr79347.c: New test.
>>
>>> diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c 
>>> b/gcc/testsuite/gcc.dg/vect/pr79347.c
>>> new file mode 100644
>>> index 000..586c638
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
>>> @@ -0,0 +1,13 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-require-effective-target vect_int } */
>>> +/* { dg-additional-options "-fdump-tree-vect-all" } */
>>> +
>>> +short *a;
>>> +int c;
>>> +void n(void)
>>> +{
>>> +  for (int i = 0; i>> +a[i]++;
>>> +}
>>
>> Thanks for fixing the prologue.  I think there is still one extra problem in 
>> the vectorizer.
>> With the internal vectorized loop I now see:
>>
>> ;;   basic block 9, loop depth 1, count 0, freq 956, maybe hot
>> ;;   Invalid sum of incoming frequencies 1961, should be 956
>> ;;prev block 8, next block 10, flags: (NEW, REACHABLE, VISITED)
>> ;;pred:   10 [100.0%]  (FALLTHRU,DFS_BACK,EXECUTABLE)
>> ;;8 [100.0%]  (FALLTHRU)
>>   # i_18 = PHI 
>>   # vectp_a.13_66 = PHI 
>>   # vectp_a.19_75 = PHI 
>>   # ivtmp_78 = PHI 
>>   _2 = (long unsigned int) i_18;
>>   _3 = _2 * 2;
>>   _4 = a.0_1 + _3;
>>   vect__5.15_68 = MEM[(short int *)vectp_a.13_66];
>>   _5 = *_4;
>>   vect__6.16_69 = VIEW_CONVERT_EXPR(vect__5.15_68);
>>   _6 = (unsigned short) _5;
>>   vect__7.17_71 = vect__6.16_69 + vect_cst__70;
>>   _7 = _6 + 1;
>>   vect__8.18_72 = VIEW_CONVERT_EXPR(vect__7.17_71);
>>   _8 = (short int) _7;
>>   MEM[(short int *)vectp_a.19_75] = vect__8.18_72;
>>   i_14 = i_18 + 1;
>>   vectp_a.13_67 = vectp_a.13_66 + 16;
>>   vectp_a.19_76 = vectp_a.19_75 + 16;
>>   ivtmp_79 = ivtmp_78 + 1;
>>   if (ivtmp_79 < bnd.10_59)
>> goto ; [85.00%]
>>   else
>> goto ; [15.00%]
>>
>> So it seems that the frequency of the loop itself is unrealistically scaled 
>> down.
>> Before vetorizing the frequency is 8500 and predicted number of iterations is
>> 6.6.  Now the loop is intereed via BB 8 with frequency 1148, so the loop, by
>> exit probability exits with 15% probability and thus still has 6.6 
>> iterations,
>> but by BB frequencies its body executes fewer times than the preheader.
>>
>> Now this is a fragile area vectirizing loop should scale number of 
>> iterations down
>> 8 times. However guessed CFG profiles are always very "flat". Of course
>> if loop iterated 6.6 times at the average vectorizing would not make any 
>> sense.
>> Making guessed profiles less flat is unrealistic, because average loop 
>> iterates few times,
>> but of course while vectorizing we make additional guess that the 
>> vectorizable loops
>> matters and the guessed profile is probably unrealistic.
> That's what I mentioned in the original patch.  Vectorizer calls
> scale_loop_profile in
> function vect_transform_loop to scale down loop's frequency regardless 
> mismatch
> between loop and preheader/exit basic blocks.  In fact, after this
> patch all mismatches
> in vectorizer are introduced by this.  I don't see any way to keep
> consistency beween
> vectorized loop and the rest program without visiting whole CFG.  So
> shall we skip
> scaling down profile counters for vectorized loop?
>
>>
>> GCC 6 seems however bit more consistent.
>>> +/* Apply probability PROB to basic block BB and its single succ edge.  */
>>> +
>>> +static void
>>> +apply_probability_for_bb (basic_block bb, int prob)
>>> +{
>>> +  bb->frequency = apply_probability (bb->frequency, prob);
>>> +  bb->count = apply_probability (bb->count, prob);
>>> +  gcc_assert (single_succ_p (bb));
>>> +  single_succ_edge (bb)->count = bb->count;
>>> +}
>>> +
>>>  /* Function vect_do_peeling.
>>>
>>> Input:
>>> @@ -1690,7 +1701,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
>>> niters, tree nitersm1,
>>>   may be preferred.  */
>>>basic_block anchor = loop_preheader_edge (loop)->src;
>>>if (skip_vector)
>>> -split_edge (loop_preheader_edge (loop));
>>> +{
>>> +  split_edge (loop_preheader_edge (loop));
>>> +
>>> +  /* Due to the order in which we peel prolog and epilog, we first
>>> +  propagate probability to the whole loop.  The purpose is to
>>> +  avoid adjusting probabilities of both prolog and vector loops
>>> +  separately.  Note in this case, the probability of epilog loop
>>> +  needs to be 

Re: [PATCH PR79347]Maintain profile counter information in vect_do_peeling

2017-02-14 Thread Pat Haugen
On 02/14/2017 07:57 AM, Jan Hubicka wrote:
> So it seems that the frequency of the loop itself is unrealistically scaled 
> down.
> Before vetorizing the frequency is 8500 and predicted number of iterations is
> 6.6.  Now the loop is intereed via BB 8 with frequency 1148, so the loop, by
> exit probability exits with 15% probability and thus still has 6.6 iterations,
> but by BB frequencies its body executes fewer times than the preheader.
> 
> Now this is a fragile area vectirizing loop should scale number of iterations 
> down
> 8 times. However guessed CFG profiles are always very "flat". Of course
> if loop iterated 6.6 times at the average vectorizing would not make any 
> sense.
> Making guessed profiles less flat is unrealistic, because average loop 
> iterates few times,
> but of course while vectorizing we make additional guess that the 
> vectorizable loops
> matters and the guessed profile is probably unrealistic.

We have the same problem in the RTL loop unroller in that we'll scale the 
unrolled loop by the unroll factor 
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68212#c3), which can result in a 
loop with lower frequency than surrounding code. Problem is compounded if we 
vectorize the loop and then unroll it. Whatever approach is decided for the 
case when we have guessed profile should be applied to both vectorizer and RTL 
loop unroller.

-Pat



Re: [PATCH PR79347]Maintain profile counter information in vect_do_peeling

2017-02-14 Thread Bin.Cheng
On Tue, Feb 14, 2017 at 1:57 PM, Jan Hubicka  wrote:
>> Thanks,
>> bin
>> 2017-02-13  Bin Cheng  
>>
>>   PR tree-optimization/79347
>>   * tree-vect-loop-manip.c (apply_probability_for_bb): New function.
>>   (vect_do_peeling): Maintain profile counters during peeling.
>>
>> gcc/testsuite/ChangeLog
>> 2017-02-13  Bin Cheng  
>>
>>   PR tree-optimization/79347
>>   * gcc.dg/vect/pr79347.c: New test.
>
>> diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c 
>> b/gcc/testsuite/gcc.dg/vect/pr79347.c
>> new file mode 100644
>> index 000..586c638
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
>> @@ -0,0 +1,13 @@
>> +/* { dg-do compile } */
>> +/* { dg-require-effective-target vect_int } */
>> +/* { dg-additional-options "-fdump-tree-vect-all" } */
>> +
>> +short *a;
>> +int c;
>> +void n(void)
>> +{
>> +  for (int i = 0; i> +a[i]++;
>> +}
>
> Thanks for fixing the prologue.  I think there is still one extra problem in 
> the vectorizer.
> With the internal vectorized loop I now see:
>
> ;;   basic block 9, loop depth 1, count 0, freq 956, maybe hot
> ;;   Invalid sum of incoming frequencies 1961, should be 956
> ;;prev block 8, next block 10, flags: (NEW, REACHABLE, VISITED)
> ;;pred:   10 [100.0%]  (FALLTHRU,DFS_BACK,EXECUTABLE)
> ;;8 [100.0%]  (FALLTHRU)
>   # i_18 = PHI 
>   # vectp_a.13_66 = PHI 
>   # vectp_a.19_75 = PHI 
>   # ivtmp_78 = PHI 
>   _2 = (long unsigned int) i_18;
>   _3 = _2 * 2;
>   _4 = a.0_1 + _3;
>   vect__5.15_68 = MEM[(short int *)vectp_a.13_66];
>   _5 = *_4;
>   vect__6.16_69 = VIEW_CONVERT_EXPR(vect__5.15_68);
>   _6 = (unsigned short) _5;
>   vect__7.17_71 = vect__6.16_69 + vect_cst__70;
>   _7 = _6 + 1;
>   vect__8.18_72 = VIEW_CONVERT_EXPR(vect__7.17_71);
>   _8 = (short int) _7;
>   MEM[(short int *)vectp_a.19_75] = vect__8.18_72;
>   i_14 = i_18 + 1;
>   vectp_a.13_67 = vectp_a.13_66 + 16;
>   vectp_a.19_76 = vectp_a.19_75 + 16;
>   ivtmp_79 = ivtmp_78 + 1;
>   if (ivtmp_79 < bnd.10_59)
> goto ; [85.00%]
>   else
> goto ; [15.00%]
>
> So it seems that the frequency of the loop itself is unrealistically scaled 
> down.
> Before vetorizing the frequency is 8500 and predicted number of iterations is
> 6.6.  Now the loop is intereed via BB 8 with frequency 1148, so the loop, by
> exit probability exits with 15% probability and thus still has 6.6 iterations,
> but by BB frequencies its body executes fewer times than the preheader.
>
> Now this is a fragile area vectirizing loop should scale number of iterations 
> down
> 8 times. However guessed CFG profiles are always very "flat". Of course
> if loop iterated 6.6 times at the average vectorizing would not make any 
> sense.
> Making guessed profiles less flat is unrealistic, because average loop 
> iterates few times,
> but of course while vectorizing we make additional guess that the 
> vectorizable loops
> matters and the guessed profile is probably unrealistic.
That's what I mentioned in the original patch.  Vectorizer calls
scale_loop_profile in
function vect_transform_loop to scale down loop's frequency regardless mismatch
between loop and preheader/exit basic blocks.  In fact, after this
patch all mismatches
in vectorizer are introduced by this.  I don't see any way to keep
consistency beween
vectorized loop and the rest program without visiting whole CFG.  So
shall we skip
scaling down profile counters for vectorized loop?

>
> GCC 6 seems however bit more consistent.
>> +/* Apply probability PROB to basic block BB and its single succ edge.  */
>> +
>> +static void
>> +apply_probability_for_bb (basic_block bb, int prob)
>> +{
>> +  bb->frequency = apply_probability (bb->frequency, prob);
>> +  bb->count = apply_probability (bb->count, prob);
>> +  gcc_assert (single_succ_p (bb));
>> +  single_succ_edge (bb)->count = bb->count;
>> +}
>> +
>>  /* Function vect_do_peeling.
>>
>> Input:
>> @@ -1690,7 +1701,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
>> niters, tree nitersm1,
>>   may be preferred.  */
>>basic_block anchor = loop_preheader_edge (loop)->src;
>>if (skip_vector)
>> -split_edge (loop_preheader_edge (loop));
>> +{
>> +  split_edge (loop_preheader_edge (loop));
>> +
>> +  /* Due to the order in which we peel prolog and epilog, we first
>> +  propagate probability to the whole loop.  The purpose is to
>> +  avoid adjusting probabilities of both prolog and vector loops
>> +  separately.  Note in this case, the probability of epilog loop
>> +  needs to be scaled back later.  */
>> +  basic_block bb_before_loop = loop_preheader_edge (loop)->src;
>> +  apply_probability_for_bb (bb_before_loop, prob_vector);
> Aha, this is the bit I missed while trying to fix it 

Re: [PATCH PR79347]Maintain profile counter information in vect_do_peeling

2017-02-14 Thread Jan Hubicka
> Thanks,
> bin
> 2017-02-13  Bin Cheng  
> 
>   PR tree-optimization/79347
>   * tree-vect-loop-manip.c (apply_probability_for_bb): New function.
>   (vect_do_peeling): Maintain profile counters during peeling.
> 
> gcc/testsuite/ChangeLog
> 2017-02-13  Bin Cheng  
> 
>   PR tree-optimization/79347
>   * gcc.dg/vect/pr79347.c: New test.

> diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c 
> b/gcc/testsuite/gcc.dg/vect/pr79347.c
> new file mode 100644
> index 000..586c638
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-additional-options "-fdump-tree-vect-all" } */
> +
> +short *a;
> +int c;
> +void n(void)
> +{
> +  for (int i = 0; i +a[i]++;
> +}

Thanks for fixing the prologue.  I think there is still one extra problem in 
the vectorizer.
With the internal vectorized loop I now see:

;;   basic block 9, loop depth 1, count 0, freq 956, maybe hot
;;   Invalid sum of incoming frequencies 1961, should be 956
;;prev block 8, next block 10, flags: (NEW, REACHABLE, VISITED)
;;pred:   10 [100.0%]  (FALLTHRU,DFS_BACK,EXECUTABLE)
;;8 [100.0%]  (FALLTHRU)
  # i_18 = PHI 
  # vectp_a.13_66 = PHI 
  # vectp_a.19_75 = PHI 
  # ivtmp_78 = PHI 
  _2 = (long unsigned int) i_18;
  _3 = _2 * 2;
  _4 = a.0_1 + _3;
  vect__5.15_68 = MEM[(short int *)vectp_a.13_66];
  _5 = *_4;
  vect__6.16_69 = VIEW_CONVERT_EXPR(vect__5.15_68);
  _6 = (unsigned short) _5;
  vect__7.17_71 = vect__6.16_69 + vect_cst__70;
  _7 = _6 + 1;
  vect__8.18_72 = VIEW_CONVERT_EXPR(vect__7.17_71);
  _8 = (short int) _7;
  MEM[(short int *)vectp_a.19_75] = vect__8.18_72;
  i_14 = i_18 + 1;
  vectp_a.13_67 = vectp_a.13_66 + 16;
  vectp_a.19_76 = vectp_a.19_75 + 16;
  ivtmp_79 = ivtmp_78 + 1;
  if (ivtmp_79 < bnd.10_59)
goto ; [85.00%]
  else
goto ; [15.00%]

So it seems that the frequency of the loop itself is unrealistically scaled 
down.
Before vetorizing the frequency is 8500 and predicted number of iterations is
6.6.  Now the loop is intereed via BB 8 with frequency 1148, so the loop, by
exit probability exits with 15% probability and thus still has 6.6 iterations,
but by BB frequencies its body executes fewer times than the preheader.

Now this is a fragile area vectirizing loop should scale number of iterations 
down
8 times. However guessed CFG profiles are always very "flat". Of course
if loop iterated 6.6 times at the average vectorizing would not make any sense.
Making guessed profiles less flat is unrealistic, because average loop iterates 
few times,
but of course while vectorizing we make additional guess that the vectorizable 
loops
matters and the guessed profile is probably unrealistic.

GCC 6 seems however bit more consistent.
> +/* Apply probability PROB to basic block BB and its single succ edge.  */
> +
> +static void
> +apply_probability_for_bb (basic_block bb, int prob)
> +{
> +  bb->frequency = apply_probability (bb->frequency, prob);
> +  bb->count = apply_probability (bb->count, prob);
> +  gcc_assert (single_succ_p (bb));
> +  single_succ_edge (bb)->count = bb->count;
> +}
> +
>  /* Function vect_do_peeling.
>  
> Input:
> @@ -1690,7 +1701,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
> niters, tree nitersm1,
>   may be preferred.  */
>basic_block anchor = loop_preheader_edge (loop)->src;
>if (skip_vector)
> -split_edge (loop_preheader_edge (loop));
> +{
> +  split_edge (loop_preheader_edge (loop));
> +
> +  /* Due to the order in which we peel prolog and epilog, we first
> +  propagate probability to the whole loop.  The purpose is to
> +  avoid adjusting probabilities of both prolog and vector loops
> +  separately.  Note in this case, the probability of epilog loop
> +  needs to be scaled back later.  */
> +  basic_block bb_before_loop = loop_preheader_edge (loop)->src;
> +  apply_probability_for_bb (bb_before_loop, prob_vector);
Aha, this is the bit I missed while trying to fix it myself.
I scale_bbs_frequencies_int(_before_loop, 1, prob_vector, REG_BR_PROB_BASE)
to do this.  I plan to revamp API for this next stage1, but lets keep this 
consistent.
Path is OK with this change and ...
> +  scale_loop_profile (loop, prob_vector, bound);
... please try to check if scaling is really done reasonably.  From the above
it seems that the vectorized loop is unrealistically scalled down that may 
prevent
further optimization for speed...

Thanks for looking into this,
Honza


[PATCH PR79347]Maintain profile counter information in vect_do_peeling

2017-02-14 Thread Bin Cheng
Hi,
This patch fixes issue reported by PR79347 by calculating/maintaining profile 
counter information
on the fly in vect_do_peeling.  Due to the order that we first peel prologue 
loop, peel epilogue loop,
and then add guarding edge skipping prolog+vector loop if niter is small, this 
patch takes a trick
that firstly scales down counters for loop before peeling and scales counters 
back after adding the
aforementioned guarding edge.  Otherwise, more work would be needed to 
calculate counters for
prolog and vector loop. After this patch, # of profile counter for tramp3d 
benchmark is improved from:

tramp3d-v4.cpp.157t.ifcvt:296
tramp3d-v4.cpp.158t.vect:1118
tramp3d-v4.cpp.159t.dce6:1118
tramp3d-v4.cpp.160t.pcom:1118
tramp3d-v4.cpp.161t.cunroll:1019
tramp3d-v4.cpp.162t.slp1:1019
tramp3d-v4.cpp.164t.ivopts:1019
tramp3d-v4.cpp.165t.lim4:1019
tramp3d-v4.cpp.166t.loopdone:1007
tramp3d-v4.cpp.167t.no_loop:31
...
tramp3d-v4.cpp.226t.optimized:1009

to:

tramp3d-v4.cpp.157t.ifcvt:296
tramp3d-v4.cpp.158t.vect:814
tramp3d-v4.cpp.159t.dce6:814
tramp3d-v4.cpp.160t.pcom:814
tramp3d-v4.cpp.161t.cunroll:723
tramp3d-v4.cpp.162t.slp1:723
tramp3d-v4.cpp.164t.ivopts:723
tramp3d-v4.cpp.165t.lim4:723
tramp3d-v4.cpp.166t.loopdone:711
tramp3d-v4.cpp.167t.no_loop:31
...
tramp3d-v4.cpp.226t.optimized:831

Bootstrap and test on x86_64 and AArch64.  Is it OK?

BTW, with the patch, vectorizer only introduces mismatches by below code in 
vect_transform_loop:

  /* Reduce loop iterations by the vectorization factor.  */
  scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vf),
  expected_iterations / vf);

Though it makes sense to scale down according to vect-factor, but it definitely 
introduces
mismatch between vector_loop's frequency and the rest program.  I also believe 
it is not
that useful to scale here, especially without profiling information.  At least 
we need to make
vector_loop's frequency consistent with the rest program.

Thanks,
bin
2017-02-13  Bin Cheng  

PR tree-optimization/79347
* tree-vect-loop-manip.c (apply_probability_for_bb): New function.
(vect_do_peeling): Maintain profile counters during peeling.

gcc/testsuite/ChangeLog
2017-02-13  Bin Cheng  

PR tree-optimization/79347
* gcc.dg/vect/pr79347.c: New test.diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c 
b/gcc/testsuite/gcc.dg/vect/pr79347.c
new file mode 100644
index 000..586c638
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-fdump-tree-vect-all" } */
+
+short *a;
+int c;
+void n(void)
+{
+  for (int i = 0; ifrequency = apply_probability (bb->frequency, prob);
+  bb->count = apply_probability (bb->count, prob);
+  gcc_assert (single_succ_p (bb));
+  single_succ_edge (bb)->count = bb->count;
+}
+
 /* Function vect_do_peeling.
 
Input:
@@ -1690,7 +1701,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, 
tree nitersm1,
  may be preferred.  */
   basic_block anchor = loop_preheader_edge (loop)->src;
   if (skip_vector)
-split_edge (loop_preheader_edge (loop));
+{
+  split_edge (loop_preheader_edge (loop));
+
+  /* Due to the order in which we peel prolog and epilog, we first
+propagate probability to the whole loop.  The purpose is to
+avoid adjusting probabilities of both prolog and vector loops
+separately.  Note in this case, the probability of epilog loop
+needs to be scaled back later.  */
+  basic_block bb_before_loop = loop_preheader_edge (loop)->src;
+  apply_probability_for_bb (bb_before_loop, prob_vector);
+  scale_loop_profile (loop, prob_vector, bound);
+}
 
   tree niters_prolog = build_int_cst (type, 0);
   source_location loop_loc = find_loop_location (loop);
@@ -1727,6 +1749,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, 
tree nitersm1,
  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
niters_prolog, build_int_cst (type, 0));
  guard_bb = loop_preheader_edge (prolog)->src;
+ basic_block bb_after_prolog = loop_preheader_edge (loop)->src;
  guard_to = split_edge (loop_preheader_edge (loop));
  guard_e = slpeel_add_loop_guard (guard_bb, guard_cond,