On Tue, Feb 14, 2017 at 1:57 PM, Jan Hubicka <hubi...@ucw.cz> wrote: >> Thanks, >> bin >> 2017-02-13 Bin Cheng <bin.ch...@arm.com> >> >> PR tree-optimization/79347 >> * tree-vect-loop-manip.c (apply_probability_for_bb): New function. >> (vect_do_peeling): Maintain profile counters during peeling. >> >> gcc/testsuite/ChangeLog >> 2017-02-13 Bin Cheng <bin.ch...@arm.com> >> >> PR tree-optimization/79347 >> * gcc.dg/vect/pr79347.c: New test. > >> diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c >> b/gcc/testsuite/gcc.dg/vect/pr79347.c >> new file mode 100644 >> index 0000000..586c638 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.dg/vect/pr79347.c >> @@ -0,0 +1,13 @@ >> +/* { dg-do compile } */ >> +/* { dg-require-effective-target vect_int } */ >> +/* { dg-additional-options "-fdump-tree-vect-all" } */ >> + >> +short *a; >> +int c; >> +void n(void) >> +{ >> + for (int i = 0; i<c;i++) >> + a[i]++; >> +} > > Thanks for fixing the prologue. I think there is still one extra problem in > the vectorizer. > With the internal vectorized loop I now see: > > ;; basic block 9, loop depth 1, count 0, freq 956, maybe hot > ;; Invalid sum of incoming frequencies 1961, should be 956 > ;; prev block 8, next block 10, flags: (NEW, REACHABLE, VISITED) > ;; pred: 10 [100.0%] (FALLTHRU,DFS_BACK,EXECUTABLE) > ;; 8 [100.0%] (FALLTHRU) > # i_18 = PHI <i_14(10), i_38(8)> > # vectp_a.13_66 = PHI <vectp_a.13_67(10), vectp_a.14_64(8)> > # vectp_a.19_75 = PHI <vectp_a.19_76(10), vectp_a.20_73(8)> > # ivtmp_78 = PHI <ivtmp_79(10), 0(8)> > _2 = (long unsigned int) i_18; > _3 = _2 * 2; > _4 = a.0_1 + _3; > vect__5.15_68 = MEM[(short int *)vectp_a.13_66]; > _5 = *_4; > vect__6.16_69 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(vect__5.15_68); > _6 = (unsigned short) _5; > vect__7.17_71 = vect__6.16_69 + vect_cst__70; > _7 = _6 + 1; > vect__8.18_72 = VIEW_CONVERT_EXPR<vector(8) short int>(vect__7.17_71); > _8 = (short int) _7; > MEM[(short int *)vectp_a.19_75] = vect__8.18_72; > i_14 = i_18 + 1; > vectp_a.13_67 = vectp_a.13_66 + 16; > vectp_a.19_76 = vectp_a.19_75 + 16; > ivtmp_79 = ivtmp_78 + 1; > if (ivtmp_79 < bnd.10_59) > goto <bb 10>; [85.00%] > else > goto <bb 11>; [15.00%] > > So it seems that the frequency of the loop itself is unrealistically scaled > down. > Before vetorizing the frequency is 8500 and predicted number of iterations is > 6.6. Now the loop is intereed via BB 8 with frequency 1148, so the loop, by > exit probability exits with 15% probability and thus still has 6.6 iterations, > but by BB frequencies its body executes fewer times than the preheader. > > Now this is a fragile area vectirizing loop should scale number of iterations > down > 8 times. However guessed CFG profiles are always very "flat". Of course > if loop iterated 6.6 times at the average vectorizing would not make any > sense. > Making guessed profiles less flat is unrealistic, because average loop > iterates few times, > but of course while vectorizing we make additional guess that the > vectorizable loops > matters and the guessed profile is probably unrealistic. That's what I mentioned in the original patch. Vectorizer calls scale_loop_profile in function vect_transform_loop to scale down loop's frequency regardless mismatch between loop and preheader/exit basic blocks. In fact, after this patch all mismatches in vectorizer are introduced by this. I don't see any way to keep consistency beween vectorized loop and the rest program without visiting whole CFG. So shall we skip scaling down profile counters for vectorized loop?
> > GCC 6 seems however bit more consistent. >> +/* Apply probability PROB to basic block BB and its single succ edge. */ >> + >> +static void >> +apply_probability_for_bb (basic_block bb, int prob) >> +{ >> + bb->frequency = apply_probability (bb->frequency, prob); >> + bb->count = apply_probability (bb->count, prob); >> + gcc_assert (single_succ_p (bb)); >> + single_succ_edge (bb)->count = bb->count; >> +} >> + >> /* Function vect_do_peeling. >> >> Input: >> @@ -1690,7 +1701,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree >> niters, tree nitersm1, >> may be preferred. */ >> basic_block anchor = loop_preheader_edge (loop)->src; >> if (skip_vector) >> - split_edge (loop_preheader_edge (loop)); >> + { >> + split_edge (loop_preheader_edge (loop)); >> + >> + /* Due to the order in which we peel prolog and epilog, we first >> + propagate probability to the whole loop. The purpose is to >> + avoid adjusting probabilities of both prolog and vector loops >> + separately. Note in this case, the probability of epilog loop >> + needs to be scaled back later. */ >> + basic_block bb_before_loop = loop_preheader_edge (loop)->src; >> + apply_probability_for_bb (bb_before_loop, prob_vector); > Aha, this is the bit I missed while trying to fix it myself. > I scale_bbs_frequencies_int(&bb_before_loop, 1, prob_vector, REG_BR_PROB_BASE) > to do this. I plan to revamp API for this next stage1, but lets keep this > consistent. Will change that. > Path is OK with this change and ... >> + scale_loop_profile (loop, prob_vector, bound); Can't do this with current interface because I need to scale up frequency after skip_vector, and sub-calls in scale_loop_profile checks validity for prob and only allows scaling down... Thanks, bin > ... please try to check if scaling is really done reasonably. From the above > it seems that the vectorized loop is unrealistically scalled down that may > prevent > further optimization for speed... > > Thanks for looking into this, > Honza