On Tue, Feb 14, 2017 at 1:57 PM, Jan Hubicka <hubi...@ucw.cz> wrote:
>> Thanks,
>> bin
>> 2017-02-13  Bin Cheng  <bin.ch...@arm.com>
>>
>>       PR tree-optimization/79347
>>       * tree-vect-loop-manip.c (apply_probability_for_bb): New function.
>>       (vect_do_peeling): Maintain profile counters during peeling.
>>
>> gcc/testsuite/ChangeLog
>> 2017-02-13  Bin Cheng  <bin.ch...@arm.com>
>>
>>       PR tree-optimization/79347
>>       * gcc.dg/vect/pr79347.c: New test.
>
>> diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c 
>> b/gcc/testsuite/gcc.dg/vect/pr79347.c
>> new file mode 100644
>> index 0000000..586c638
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
>> @@ -0,0 +1,13 @@
>> +/* { dg-do compile } */
>> +/* { dg-require-effective-target vect_int } */
>> +/* { dg-additional-options "-fdump-tree-vect-all" } */
>> +
>> +short *a;
>> +int c;
>> +void n(void)
>> +{
>> +  for (int i = 0; i<c;i++)
>> +    a[i]++;
>> +}
>
> Thanks for fixing the prologue.  I think there is still one extra problem in 
> the vectorizer.
> With the internal vectorized loop I now see:
>
> ;;   basic block 9, loop depth 1, count 0, freq 956, maybe hot
> ;;   Invalid sum of incoming frequencies 1961, should be 956
> ;;    prev block 8, next block 10, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       10 [100.0%]  (FALLTHRU,DFS_BACK,EXECUTABLE)
> ;;                8 [100.0%]  (FALLTHRU)
>   # i_18 = PHI <i_14(10), i_38(8)>
>   # vectp_a.13_66 = PHI <vectp_a.13_67(10), vectp_a.14_64(8)>
>   # vectp_a.19_75 = PHI <vectp_a.19_76(10), vectp_a.20_73(8)>
>   # ivtmp_78 = PHI <ivtmp_79(10), 0(8)>
>   _2 = (long unsigned int) i_18;
>   _3 = _2 * 2;
>   _4 = a.0_1 + _3;
>   vect__5.15_68 = MEM[(short int *)vectp_a.13_66];
>   _5 = *_4;
>   vect__6.16_69 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(vect__5.15_68);
>   _6 = (unsigned short) _5;
>   vect__7.17_71 = vect__6.16_69 + vect_cst__70;
>   _7 = _6 + 1;
>   vect__8.18_72 = VIEW_CONVERT_EXPR<vector(8) short int>(vect__7.17_71);
>   _8 = (short int) _7;
>   MEM[(short int *)vectp_a.19_75] = vect__8.18_72;
>   i_14 = i_18 + 1;
>   vectp_a.13_67 = vectp_a.13_66 + 16;
>   vectp_a.19_76 = vectp_a.19_75 + 16;
>   ivtmp_79 = ivtmp_78 + 1;
>   if (ivtmp_79 < bnd.10_59)
>     goto <bb 10>; [85.00%]
>   else
>     goto <bb 11>; [15.00%]
>
> So it seems that the frequency of the loop itself is unrealistically scaled 
> down.
> Before vetorizing the frequency is 8500 and predicted number of iterations is
> 6.6.  Now the loop is intereed via BB 8 with frequency 1148, so the loop, by
> exit probability exits with 15% probability and thus still has 6.6 iterations,
> but by BB frequencies its body executes fewer times than the preheader.
>
> Now this is a fragile area vectirizing loop should scale number of iterations 
> down
> 8 times. However guessed CFG profiles are always very "flat". Of course
> if loop iterated 6.6 times at the average vectorizing would not make any 
> sense.
> Making guessed profiles less flat is unrealistic, because average loop 
> iterates few times,
> but of course while vectorizing we make additional guess that the 
> vectorizable loops
> matters and the guessed profile is probably unrealistic.
That's what I mentioned in the original patch.  Vectorizer calls
scale_loop_profile in
function vect_transform_loop to scale down loop's frequency regardless mismatch
between loop and preheader/exit basic blocks.  In fact, after this
patch all mismatches
in vectorizer are introduced by this.  I don't see any way to keep
consistency beween
vectorized loop and the rest program without visiting whole CFG.  So
shall we skip
scaling down profile counters for vectorized loop?

>
> GCC 6 seems however bit more consistent.
>> +/* Apply probability PROB to basic block BB and its single succ edge.  */
>> +
>> +static void
>> +apply_probability_for_bb (basic_block bb, int prob)
>> +{
>> +  bb->frequency = apply_probability (bb->frequency, prob);
>> +  bb->count = apply_probability (bb->count, prob);
>> +  gcc_assert (single_succ_p (bb));
>> +  single_succ_edge (bb)->count = bb->count;
>> +}
>> +
>>  /* Function vect_do_peeling.
>>
>>     Input:
>> @@ -1690,7 +1701,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
>> niters, tree nitersm1,
>>       may be preferred.  */
>>    basic_block anchor = loop_preheader_edge (loop)->src;
>>    if (skip_vector)
>> -    split_edge (loop_preheader_edge (loop));
>> +    {
>> +      split_edge (loop_preheader_edge (loop));
>> +
>> +      /* Due to the order in which we peel prolog and epilog, we first
>> +      propagate probability to the whole loop.  The purpose is to
>> +      avoid adjusting probabilities of both prolog and vector loops
>> +      separately.  Note in this case, the probability of epilog loop
>> +      needs to be scaled back later.  */
>> +      basic_block bb_before_loop = loop_preheader_edge (loop)->src;
>> +      apply_probability_for_bb (bb_before_loop, prob_vector);
> Aha, this is the bit I missed while trying to fix it myself.
> I scale_bbs_frequencies_int(&bb_before_loop, 1, prob_vector, REG_BR_PROB_BASE)
> to do this.  I plan to revamp API for this next stage1, but lets keep this 
> consistent.
Will change that.

> Path is OK with this change and ...
>> +      scale_loop_profile (loop, prob_vector, bound);
Can't do this with current interface because I need to scale up
frequency after skip_vector,
and sub-calls in scale_loop_profile checks validity for prob and only
allows scaling down...

Thanks,
bin
> ... please try to check if scaling is really done reasonably.  From the above
> it seems that the vectorized loop is unrealistically scalled down that may 
> prevent
> further optimization for speed...
>
> Thanks for looking into this,
> Honza

Reply via email to