On Tue, Feb 14, 2017 at 2:13 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Tue, Feb 14, 2017 at 1:57 PM, Jan Hubicka <hubi...@ucw.cz> wrote:
>>> Thanks,
>>> bin
>>> 2017-02-13  Bin Cheng  <bin.ch...@arm.com>
>>>
>>>       PR tree-optimization/79347
>>>       * tree-vect-loop-manip.c (apply_probability_for_bb): New function.
>>>       (vect_do_peeling): Maintain profile counters during peeling.
>>>
>>> gcc/testsuite/ChangeLog
>>> 2017-02-13  Bin Cheng  <bin.ch...@arm.com>
>>>
>>>       PR tree-optimization/79347
>>>       * gcc.dg/vect/pr79347.c: New test.
>>
>>> diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c 
>>> b/gcc/testsuite/gcc.dg/vect/pr79347.c
>>> new file mode 100644
>>> index 0000000..586c638
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
>>> @@ -0,0 +1,13 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-require-effective-target vect_int } */
>>> +/* { dg-additional-options "-fdump-tree-vect-all" } */
>>> +
>>> +short *a;
>>> +int c;
>>> +void n(void)
>>> +{
>>> +  for (int i = 0; i<c;i++)
>>> +    a[i]++;
>>> +}
>>
>> Thanks for fixing the prologue.  I think there is still one extra problem in 
>> the vectorizer.
>> With the internal vectorized loop I now see:
>>
>> ;;   basic block 9, loop depth 1, count 0, freq 956, maybe hot
>> ;;   Invalid sum of incoming frequencies 1961, should be 956
>> ;;    prev block 8, next block 10, flags: (NEW, REACHABLE, VISITED)
>> ;;    pred:       10 [100.0%]  (FALLTHRU,DFS_BACK,EXECUTABLE)
>> ;;                8 [100.0%]  (FALLTHRU)
>>   # i_18 = PHI <i_14(10), i_38(8)>
>>   # vectp_a.13_66 = PHI <vectp_a.13_67(10), vectp_a.14_64(8)>
>>   # vectp_a.19_75 = PHI <vectp_a.19_76(10), vectp_a.20_73(8)>
>>   # ivtmp_78 = PHI <ivtmp_79(10), 0(8)>
>>   _2 = (long unsigned int) i_18;
>>   _3 = _2 * 2;
>>   _4 = a.0_1 + _3;
>>   vect__5.15_68 = MEM[(short int *)vectp_a.13_66];
>>   _5 = *_4;
>>   vect__6.16_69 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(vect__5.15_68);
>>   _6 = (unsigned short) _5;
>>   vect__7.17_71 = vect__6.16_69 + vect_cst__70;
>>   _7 = _6 + 1;
>>   vect__8.18_72 = VIEW_CONVERT_EXPR<vector(8) short int>(vect__7.17_71);
>>   _8 = (short int) _7;
>>   MEM[(short int *)vectp_a.19_75] = vect__8.18_72;
>>   i_14 = i_18 + 1;
>>   vectp_a.13_67 = vectp_a.13_66 + 16;
>>   vectp_a.19_76 = vectp_a.19_75 + 16;
>>   ivtmp_79 = ivtmp_78 + 1;
>>   if (ivtmp_79 < bnd.10_59)
>>     goto <bb 10>; [85.00%]
>>   else
>>     goto <bb 11>; [15.00%]
>>
>> So it seems that the frequency of the loop itself is unrealistically scaled 
>> down.
>> Before vetorizing the frequency is 8500 and predicted number of iterations is
>> 6.6.  Now the loop is intereed via BB 8 with frequency 1148, so the loop, by
>> exit probability exits with 15% probability and thus still has 6.6 
>> iterations,
>> but by BB frequencies its body executes fewer times than the preheader.
>>
>> Now this is a fragile area vectirizing loop should scale number of 
>> iterations down
>> 8 times. However guessed CFG profiles are always very "flat". Of course
>> if loop iterated 6.6 times at the average vectorizing would not make any 
>> sense.
>> Making guessed profiles less flat is unrealistic, because average loop 
>> iterates few times,
>> but of course while vectorizing we make additional guess that the 
>> vectorizable loops
>> matters and the guessed profile is probably unrealistic.
> That's what I mentioned in the original patch.  Vectorizer calls
> scale_loop_profile in
> function vect_transform_loop to scale down loop's frequency regardless 
> mismatch
> between loop and preheader/exit basic blocks.  In fact, after this
> patch all mismatches
> in vectorizer are introduced by this.  I don't see any way to keep
> consistency beween
> vectorized loop and the rest program without visiting whole CFG.  So
> shall we skip
> scaling down profile counters for vectorized loop?
>
>>
>> GCC 6 seems however bit more consistent.
>>> +/* Apply probability PROB to basic block BB and its single succ edge.  */
>>> +
>>> +static void
>>> +apply_probability_for_bb (basic_block bb, int prob)
>>> +{
>>> +  bb->frequency = apply_probability (bb->frequency, prob);
>>> +  bb->count = apply_probability (bb->count, prob);
>>> +  gcc_assert (single_succ_p (bb));
>>> +  single_succ_edge (bb)->count = bb->count;
>>> +}
>>> +
>>>  /* Function vect_do_peeling.
>>>
>>>     Input:
>>> @@ -1690,7 +1701,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
>>> niters, tree nitersm1,
>>>       may be preferred.  */
>>>    basic_block anchor = loop_preheader_edge (loop)->src;
>>>    if (skip_vector)
>>> -    split_edge (loop_preheader_edge (loop));
>>> +    {
>>> +      split_edge (loop_preheader_edge (loop));
>>> +
>>> +      /* Due to the order in which we peel prolog and epilog, we first
>>> +      propagate probability to the whole loop.  The purpose is to
>>> +      avoid adjusting probabilities of both prolog and vector loops
>>> +      separately.  Note in this case, the probability of epilog loop
>>> +      needs to be scaled back later.  */
>>> +      basic_block bb_before_loop = loop_preheader_edge (loop)->src;
>>> +      apply_probability_for_bb (bb_before_loop, prob_vector);
>> Aha, this is the bit I missed while trying to fix it myself.
>> I scale_bbs_frequencies_int(&bb_before_loop, 1, prob_vector, 
>> REG_BR_PROB_BASE)
>> to do this.  I plan to revamp API for this next stage1, but lets keep this 
>> consistent.
> Will change that.
>
>> Path is OK with this change and ...
>>> +      scale_loop_profile (loop, prob_vector, bound);
> Can't do this with current interface because I need to scale up
> frequency after skip_vector,
> and sub-calls in scale_loop_profile checks validity for prob and only
> allows scaling down...

Hi,
Attachment is the updated patch.  Bootstrap and tested.  Is it OK?

Thanks,
bin

2017-02-13  Bin Cheng  <bin.ch...@arm.com>

    PR tree-optimization/79347
    * tree-vect-loop-manip.c (vect_do_peeling): Maintain profile
    counters during peeling.

gcc/testsuite/ChangeLog
2017-02-13  Bin Cheng  <bin.ch...@arm.com>

    PR tree-optimization/79347
    * gcc.dg/vect/pr79347.c: New test.

>
> Thanks,
> bin
>> ... please try to check if scaling is really done reasonably.  From the above
>> it seems that the vectorized loop is unrealistically scalled down that may 
>> prevent
>> further optimization for speed...
>>
>> Thanks for looking into this,
>> Honza
diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c 
b/gcc/testsuite/gcc.dg/vect/pr79347.c
new file mode 100644
index 0000000..586c638
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-fdump-tree-vect-all" } */
+
+short *a;
+int c;
+void n(void)
+{
+  for (int i = 0; i<c;i++)
+    a[i]++;
+}
+
+/* { dg-final { scan-tree-dump-times "Invalid sum of " 2 "vect" } } */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index f29449c..5ee2c38 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1690,7 +1690,19 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, 
tree nitersm1,
      may be preferred.  */
   basic_block anchor = loop_preheader_edge (loop)->src;
   if (skip_vector)
-    split_edge (loop_preheader_edge (loop));
+    {
+      split_edge (loop_preheader_edge (loop));
+
+      /* Due to the order in which we peel prolog and epilog, we first
+        propagate probability to the whole loop.  The purpose is to
+        avoid adjusting probabilities of both prolog and vector loops
+        separately.  Note in this case, the probability of epilog loop
+        needs to be scaled back later.  */
+      basic_block bb_before_loop = loop_preheader_edge (loop)->src;
+      scale_bbs_frequencies_int (&bb_before_loop, 1, prob_vector,
+                                REG_BR_PROB_BASE);
+      scale_loop_profile (loop, prob_vector, bound);
+    }
 
   tree niters_prolog = build_int_cst (type, 0);
   source_location loop_loc = find_loop_location (loop);
@@ -1727,6 +1739,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, 
tree nitersm1,
          guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
                                    niters_prolog, build_int_cst (type, 0));
          guard_bb = loop_preheader_edge (prolog)->src;
+         basic_block bb_after_prolog = loop_preheader_edge (loop)->src;
          guard_to = split_edge (loop_preheader_edge (loop));
          guard_e = slpeel_add_loop_guard (guard_bb, guard_cond,
                                           guard_to, guard_bb,
@@ -1734,6 +1747,9 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, 
tree nitersm1,
          e = EDGE_PRED (guard_to, 0);
          e = (e != guard_e ? e : EDGE_PRED (guard_to, 1));
          slpeel_update_phi_nodes_for_guard1 (prolog, loop, guard_e, e);
+
+         scale_bbs_frequencies_int (&bb_after_prolog, 1, prob_prolog,
+                                    REG_BR_PROB_BASE);
          scale_loop_profile (prolog, prob_prolog, bound_prolog);
        }
       /* Update init address of DRs.  */
@@ -1796,9 +1812,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, 
tree nitersm1,
          e = EDGE_PRED (guard_to, 0);
          e = (e != guard_e ? e : EDGE_PRED (guard_to, 1));
          slpeel_update_phi_nodes_for_guard1 (first_loop, epilog, guard_e, e);
-         scale_loop_profile (epilog, prob_vector, bound_scalar);
+
+         /* Simply propagate profile info from guard_bb to guard_to which is
+            a merge point of control flow.  */
+         guard_to->frequency = guard_bb->frequency;
+         guard_to->count = guard_bb->count;
+         single_succ_edge (guard_to)->count = guard_to->count;
+         /* Scale probability of epilog loop back.  */
+         int scale_up = REG_BR_PROB_BASE * REG_BR_PROB_BASE / prob_vector;
+         scale_loop_frequencies (epilog, scale_up, REG_BR_PROB_BASE);
        }
 
+      basic_block bb_before_epilog = loop_preheader_edge (epilog)->src;
       tree niters_vector_mult_vf;
       /* If loop is peeled for non-zero constant times, now niters refers to
         orig_niters - prolog_peeling, it won't overflow even the orig_niters
@@ -1826,6 +1851,16 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, 
tree nitersm1,
                                           inverse_probability (prob_epilog));
          slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
                                              single_exit (epilog));
+         /* Only need to handle basic block before epilog loop if it's not
+            the guard_bb, which is the case when skip_vector is true.  */
+         if (guard_bb != bb_before_epilog)
+           {
+             prob_epilog = (combine_probabilities (prob_vector, prob_epilog)
+                            + inverse_probability (prob_vector));
+
+             scale_bbs_frequencies_int (&bb_before_epilog, 1, prob_epilog,
+                                        REG_BR_PROB_BASE);
+           }
          scale_loop_profile (epilog, prob_epilog, bound);
        }
       else

Reply via email to