On Tue, Feb 14, 2017 at 2:13 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Tue, Feb 14, 2017 at 1:57 PM, Jan Hubicka <hubi...@ucw.cz> wrote: >>> Thanks, >>> bin >>> 2017-02-13 Bin Cheng <bin.ch...@arm.com> >>> >>> PR tree-optimization/79347 >>> * tree-vect-loop-manip.c (apply_probability_for_bb): New function. >>> (vect_do_peeling): Maintain profile counters during peeling. >>> >>> gcc/testsuite/ChangeLog >>> 2017-02-13 Bin Cheng <bin.ch...@arm.com> >>> >>> PR tree-optimization/79347 >>> * gcc.dg/vect/pr79347.c: New test. >> >>> diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c >>> b/gcc/testsuite/gcc.dg/vect/pr79347.c >>> new file mode 100644 >>> index 0000000..586c638 >>> --- /dev/null >>> +++ b/gcc/testsuite/gcc.dg/vect/pr79347.c >>> @@ -0,0 +1,13 @@ >>> +/* { dg-do compile } */ >>> +/* { dg-require-effective-target vect_int } */ >>> +/* { dg-additional-options "-fdump-tree-vect-all" } */ >>> + >>> +short *a; >>> +int c; >>> +void n(void) >>> +{ >>> + for (int i = 0; i<c;i++) >>> + a[i]++; >>> +} >> >> Thanks for fixing the prologue. I think there is still one extra problem in >> the vectorizer. >> With the internal vectorized loop I now see: >> >> ;; basic block 9, loop depth 1, count 0, freq 956, maybe hot >> ;; Invalid sum of incoming frequencies 1961, should be 956 >> ;; prev block 8, next block 10, flags: (NEW, REACHABLE, VISITED) >> ;; pred: 10 [100.0%] (FALLTHRU,DFS_BACK,EXECUTABLE) >> ;; 8 [100.0%] (FALLTHRU) >> # i_18 = PHI <i_14(10), i_38(8)> >> # vectp_a.13_66 = PHI <vectp_a.13_67(10), vectp_a.14_64(8)> >> # vectp_a.19_75 = PHI <vectp_a.19_76(10), vectp_a.20_73(8)> >> # ivtmp_78 = PHI <ivtmp_79(10), 0(8)> >> _2 = (long unsigned int) i_18; >> _3 = _2 * 2; >> _4 = a.0_1 + _3; >> vect__5.15_68 = MEM[(short int *)vectp_a.13_66]; >> _5 = *_4; >> vect__6.16_69 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(vect__5.15_68); >> _6 = (unsigned short) _5; >> vect__7.17_71 = vect__6.16_69 + vect_cst__70; >> _7 = _6 + 1; >> vect__8.18_72 = VIEW_CONVERT_EXPR<vector(8) short int>(vect__7.17_71); >> _8 = (short int) _7; >> MEM[(short int *)vectp_a.19_75] = vect__8.18_72; >> i_14 = i_18 + 1; >> vectp_a.13_67 = vectp_a.13_66 + 16; >> vectp_a.19_76 = vectp_a.19_75 + 16; >> ivtmp_79 = ivtmp_78 + 1; >> if (ivtmp_79 < bnd.10_59) >> goto <bb 10>; [85.00%] >> else >> goto <bb 11>; [15.00%] >> >> So it seems that the frequency of the loop itself is unrealistically scaled >> down. >> Before vetorizing the frequency is 8500 and predicted number of iterations is >> 6.6. Now the loop is intereed via BB 8 with frequency 1148, so the loop, by >> exit probability exits with 15% probability and thus still has 6.6 >> iterations, >> but by BB frequencies its body executes fewer times than the preheader. >> >> Now this is a fragile area vectirizing loop should scale number of >> iterations down >> 8 times. However guessed CFG profiles are always very "flat". Of course >> if loop iterated 6.6 times at the average vectorizing would not make any >> sense. >> Making guessed profiles less flat is unrealistic, because average loop >> iterates few times, >> but of course while vectorizing we make additional guess that the >> vectorizable loops >> matters and the guessed profile is probably unrealistic. > That's what I mentioned in the original patch. Vectorizer calls > scale_loop_profile in > function vect_transform_loop to scale down loop's frequency regardless > mismatch > between loop and preheader/exit basic blocks. In fact, after this > patch all mismatches > in vectorizer are introduced by this. I don't see any way to keep > consistency beween > vectorized loop and the rest program without visiting whole CFG. So > shall we skip > scaling down profile counters for vectorized loop? > >> >> GCC 6 seems however bit more consistent. >>> +/* Apply probability PROB to basic block BB and its single succ edge. */ >>> + >>> +static void >>> +apply_probability_for_bb (basic_block bb, int prob) >>> +{ >>> + bb->frequency = apply_probability (bb->frequency, prob); >>> + bb->count = apply_probability (bb->count, prob); >>> + gcc_assert (single_succ_p (bb)); >>> + single_succ_edge (bb)->count = bb->count; >>> +} >>> + >>> /* Function vect_do_peeling. >>> >>> Input: >>> @@ -1690,7 +1701,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree >>> niters, tree nitersm1, >>> may be preferred. */ >>> basic_block anchor = loop_preheader_edge (loop)->src; >>> if (skip_vector) >>> - split_edge (loop_preheader_edge (loop)); >>> + { >>> + split_edge (loop_preheader_edge (loop)); >>> + >>> + /* Due to the order in which we peel prolog and epilog, we first >>> + propagate probability to the whole loop. The purpose is to >>> + avoid adjusting probabilities of both prolog and vector loops >>> + separately. Note in this case, the probability of epilog loop >>> + needs to be scaled back later. */ >>> + basic_block bb_before_loop = loop_preheader_edge (loop)->src; >>> + apply_probability_for_bb (bb_before_loop, prob_vector); >> Aha, this is the bit I missed while trying to fix it myself. >> I scale_bbs_frequencies_int(&bb_before_loop, 1, prob_vector, >> REG_BR_PROB_BASE) >> to do this. I plan to revamp API for this next stage1, but lets keep this >> consistent. > Will change that. > >> Path is OK with this change and ... >>> + scale_loop_profile (loop, prob_vector, bound); > Can't do this with current interface because I need to scale up > frequency after skip_vector, > and sub-calls in scale_loop_profile checks validity for prob and only > allows scaling down...
Hi, Attachment is the updated patch. Bootstrap and tested. Is it OK? Thanks, bin 2017-02-13 Bin Cheng <bin.ch...@arm.com> PR tree-optimization/79347 * tree-vect-loop-manip.c (vect_do_peeling): Maintain profile counters during peeling. gcc/testsuite/ChangeLog 2017-02-13 Bin Cheng <bin.ch...@arm.com> PR tree-optimization/79347 * gcc.dg/vect/pr79347.c: New test. > > Thanks, > bin >> ... please try to check if scaling is really done reasonably. From the above >> it seems that the vectorized loop is unrealistically scalled down that may >> prevent >> further optimization for speed... >> >> Thanks for looking into this, >> Honza
diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c b/gcc/testsuite/gcc.dg/vect/pr79347.c new file mode 100644 index 0000000..586c638 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr79347.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-fdump-tree-vect-all" } */ + +short *a; +int c; +void n(void) +{ + for (int i = 0; i<c;i++) + a[i]++; +} + +/* { dg-final { scan-tree-dump-times "Invalid sum of " 2 "vect" } } */ diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index f29449c..5ee2c38 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -1690,7 +1690,19 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, may be preferred. */ basic_block anchor = loop_preheader_edge (loop)->src; if (skip_vector) - split_edge (loop_preheader_edge (loop)); + { + split_edge (loop_preheader_edge (loop)); + + /* Due to the order in which we peel prolog and epilog, we first + propagate probability to the whole loop. The purpose is to + avoid adjusting probabilities of both prolog and vector loops + separately. Note in this case, the probability of epilog loop + needs to be scaled back later. */ + basic_block bb_before_loop = loop_preheader_edge (loop)->src; + scale_bbs_frequencies_int (&bb_before_loop, 1, prob_vector, + REG_BR_PROB_BASE); + scale_loop_profile (loop, prob_vector, bound); + } tree niters_prolog = build_int_cst (type, 0); source_location loop_loc = find_loop_location (loop); @@ -1727,6 +1739,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, guard_cond = fold_build2 (EQ_EXPR, boolean_type_node, niters_prolog, build_int_cst (type, 0)); guard_bb = loop_preheader_edge (prolog)->src; + basic_block bb_after_prolog = loop_preheader_edge (loop)->src; guard_to = split_edge (loop_preheader_edge (loop)); guard_e = slpeel_add_loop_guard (guard_bb, guard_cond, guard_to, guard_bb, @@ -1734,6 +1747,9 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, e = EDGE_PRED (guard_to, 0); e = (e != guard_e ? e : EDGE_PRED (guard_to, 1)); slpeel_update_phi_nodes_for_guard1 (prolog, loop, guard_e, e); + + scale_bbs_frequencies_int (&bb_after_prolog, 1, prob_prolog, + REG_BR_PROB_BASE); scale_loop_profile (prolog, prob_prolog, bound_prolog); } /* Update init address of DRs. */ @@ -1796,9 +1812,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, e = EDGE_PRED (guard_to, 0); e = (e != guard_e ? e : EDGE_PRED (guard_to, 1)); slpeel_update_phi_nodes_for_guard1 (first_loop, epilog, guard_e, e); - scale_loop_profile (epilog, prob_vector, bound_scalar); + + /* Simply propagate profile info from guard_bb to guard_to which is + a merge point of control flow. */ + guard_to->frequency = guard_bb->frequency; + guard_to->count = guard_bb->count; + single_succ_edge (guard_to)->count = guard_to->count; + /* Scale probability of epilog loop back. */ + int scale_up = REG_BR_PROB_BASE * REG_BR_PROB_BASE / prob_vector; + scale_loop_frequencies (epilog, scale_up, REG_BR_PROB_BASE); } + basic_block bb_before_epilog = loop_preheader_edge (epilog)->src; tree niters_vector_mult_vf; /* If loop is peeled for non-zero constant times, now niters refers to orig_niters - prolog_peeling, it won't overflow even the orig_niters @@ -1826,6 +1851,16 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, inverse_probability (prob_epilog)); slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e, single_exit (epilog)); + /* Only need to handle basic block before epilog loop if it's not + the guard_bb, which is the case when skip_vector is true. */ + if (guard_bb != bb_before_epilog) + { + prob_epilog = (combine_probabilities (prob_vector, prob_epilog) + + inverse_probability (prob_vector)); + + scale_bbs_frequencies_int (&bb_before_epilog, 1, prob_epilog, + REG_BR_PROB_BASE); + } scale_loop_profile (epilog, prob_epilog, bound); } else