On Thu, Nov 30, 2017 at 3:13 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Thu, Nov 30, 2017 at 1:01 PM, Richard Biener > <richard.guent...@gmail.com> wrote: >> On Tue, Nov 28, 2017 at 4:26 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>> Hi, >>> This is updated patch with review comments resolved. Some explanation >>> embedded below. >>> >>>> + >>>> + iloop->nb_iterations = nb_iterations; >>>> >>>> use std::swap? Also I think if you can keep nb_iterations you >>>> can certainly keep the upper bounds. You're probably >>>> afraid of the ->stmt references in the nb_iter_bound entries? >>>> >>>> Anyway, either scrap everything or try to keep everything. >>> Yeah, not only the stmts, but also the control_iv information because the >>> SCEV >>> information may be corrupted during code transformation. >>> Now I discarded all the information. >> >> Note that given you interchange the loops but not the CFG or the loop >> structures >> you might want to swap loop->num and flags like ->force_vectorize. That is, >> essentially change the ->header/latch association (and other CFG related >> stuff >> like recorded exits). >> >> It might also be we want to / need to disable interchange for, say, >> ->force_vectorize >> inner loops or ->unroll != 0? Or we need to clear them, maybe >> optionally diagnosing >> that fact. >> >> At least we need to think about what it means to preserve loop >> structure (semantically, >> loop->num should maintain association to the same source-level loop >> throughout the >> compilation) for transforms like interchange. >> >>>> >>>> + for (i = 0; oloop.reductions.iterate (i, &re); ++i) >>>> + { >>>> + if (re->type != DOUBLE_RTYPE) >>>> + gcc_unreachable (); >>>> + >>>> + use_operand_p use_p; >>>> + imm_use_iterator iterator; >>>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var) >>>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var); >>>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next) >>>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next); >>>> + if (TREE_CODE (re->init) == SSA_NAME) >>>> + { >>>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init) >>>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init); >>>> + } >>>> >>>> can you add a comment what you are doing here? >>>> >>>> Note that other loop opts simply scrap all debug stmts ... >>> As mentioned above, updated patch doesn't try hard to maintain debug use >>> info any more. >>> >>>> >>>> +static void >>>> +compute_access_stride (struct loop *loop_nest, struct loop *loop, >>>> + data_reference_p dr) >>>> +{ >>>> ... >>>> + tree ref = DR_REF (dr); >>>> + tree scev_base = build_fold_addr_expr (ref); >>>> + tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref)); >>>> + tree niters = build_int_cst (sizetype, AVG_LOOP_NITER); >>>> + access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size); >>>> + >>>> + do { >>>> + tree scev_fn = analyze_scalar_evolution (loop, scev_base); >>>> + if (chrec_contains_undetermined (scev_fn) >>>> + || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num)) >>>> + break; >>>> ... >>>> + strides->safe_push (scev_step); >>>> + } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL); >>>> + >>>> >>>> I _think_ you want to do >>>> >>>> scev_fn = analyze_scalar_evolution (loop, scev_base); // assuming >>>> DR_STMT (dr) is in loop >>>> scev_fn = instantiate_parameters (nest, scev_fn); >>>> if (chrec_contains_undetermined (scev_fn)) >>>> return; // false? >>>> >>>> and analyze the result which should be of the form >>>> >>>> { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ... >>>> >>>> if canonical. I think estimate_val_by_simplify_replace isn't needed >>>> if you do that >>>> (it also looks odd to replace all vairables in step by niter...). >>> I replied on this in previous message, instantiate_parameters doesn't always >>> give canonical form result as expected. The loop here could be seen as a >>> local instantiate process, right? >> >> Kind of. I'll see if I can reproduce the difference with any of your >> intercahnge >> testcases - any hint which one to look at? > For added tests, I think there will be no difference between the two. > I noticed the difference for > pointer cases like: > for (i...) > for (j...) > for (k...) > p[i*n*n + j*n+ k] =... > >> >>> Also estimate_val_by_simplify_replace is needed for pointers, where strides >>> are computed from niters of loops which could be non compilation time >>> constant. >>> But yes, it's an odd fixup after I failed to do anything better. >> >> But you are for example computing _1 - _2 to zero, right? Because both _1 >> and _2 are not constant and thus you replace it with the same (symbolical) >> constant 'niter'. >> >> I think that asks for garbage-in-garbage-out ... >> >> Which testcase is this important for so I can have a look? > So far this is only for the above pointer case. Actually I don't > think it's that important, and thought about skip it. > So we don't have to do estimate_val_by_simplify_replace.
Ok, I'd like to "dumb" the pass down to the level we can solve the bwave case (which I realize is already one of the more complicated ones). Just because it's already late for GCC 8. Richard. > Thanks, > bin >> >>>> >>>> I think keeping the chrec in the above form is also more suitable for what >>>> the caller does so the outermost loop is simply >>>> >>>> loop = loop_nest; >>>> loop-over-all-dr-chrecs >>>> if (flow_loop_nested_p (loop, CHREC_LOOP (chrec))) >>>> loop = CHREC_LOOP (chrec); >>>> >>>> given the outermost loop is the outer evolution. If you sort the >>>> stride vecs from inner >>>> to outer you don't need prune_access_strides_not_in_loop. >>> Hmm, So stripping outer loops prefer inner to outer sort of strides, but >>> cost computation >>> during interchange prefers outer to inner sort because loop_nest in >>> tree-data-ref is sorted >>> in this way. Seems a single prune_* function is better than fiddling with >>> cost computation. >> >> Not sure how to interpret your answer... I'll see to have a more >> detailed suggestion >> after playing with the code a bit. >> >>>> >>>> +/* Count and return the number of loops in LOOP_NEST. */ >>>> + >>>> +unsigned int >>>> +num_loops_in_loop_nest (struct loop *loop_nest) >>>> +{ >>>> + unsigned num_loops; >>>> + for (num_loops = 0; loop_nest; num_loops++, loop_nest = >>>> loop_nest->inner) >>>> + ; >>>> + return num_loops; >>>> >>>> loop_depth (innermost) - loop_depth (nest)? >>> Done. >>> >>>> >>>> +static bool >>>> +should_interchange_loops (unsigned i_idx, unsigned o_idx, >>>> + vec<data_reference_p> datarefs, >>>> + bool innermost_loops_p, bool dump_info_p = true) >>>> +{ >>>> >>>> isn't all we need associating the above CHREC to sort after the >>>> CHREC_RIGHTs >>>> and figure a permutation sequence to arrive there? That is for the local >>>> decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT [o_idx] >>>> when we should interchange? >>>> >>>> That subloop_stride_p and tracking invariant DRs looks a bit odd. For >>>> loops >>>> where a DR is invariant you simply do not have an evolution in that loop. >>>> You seem to simply add strides in the inner and outer loops for each DR, >>>> can you explain how that works? Also I guess strides bigger than the >>>> various cache-line size parameters should be treated 'equal'? That is, >>>> if we don't get any DR to a step that results in L1 cache hits because >>>> two DRs share a cache line the interchange is pointless? >>> So given below loop: >>> >>> for (int i = 0; i < M; i++) >>> for (int j = 0; j < M; j++) >>> for (int k = 0; k < M; k++) >>> a[k][0][i] = b[k][0][i] >>> >>> We check if memory reference is invariant wrto a loop only if it has zero >>> strides within >>> current loop nest. In this example, there is no invariant given address >>> changes in the >>> innermost loop. >> >> But they simply wouldn't take part in the sorting? That is, invariant >> refs in a loop >> shouldn't prevent it becoming more inner or more outer, no? >> >>> For strides bigger than cache-line size, it's also possible the interchange >>> is wanted, as >>> in below example: >>> >>> for (int i = 0; i < M; i++) //loop 1 >>> for (int j = 0; j < M; j++) //loop 2 >>> for (int k = 0; k < M; k++) //loop 3 >>> a[j][i][k] = b[j][i][k] >>> >>> Strides for loop 1/2 are very like to be big, but after interchange, we >>> will have stream >>> access of both arrays. >>> >>> More advanced heuristics may be possible, but so far the estimation is >>> quite good by >>> checking all interchanges I looked into. >>> >>>> >>>> +/* Prune DATAREFS by removing any data reference not inside of LOOP. */ >>>> + >>>> +static inline void >>>> +prune_datarefs_not_in_loop (struct loop *loop, vec<data_reference_p> >>>> datarefs) >>>> +{ >>>> + struct data_reference *dr; >>>> + >>>> + for (unsigned i = 0; datarefs.iterate (i, &dr);) >>>> + if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr)))) >>>> + i++; >>>> + else >>>> + { >>>> + datarefs.ordered_remove (i); >>>> >>>> that's expensive. It's better to keep moving DRs we want to keep >>>> when walking the array. That is, add a j you increment only when >>>> we keep a DR, moving *i to *j. >>> Done. >>> >>>> >>>> + if (dr->aux) >>>> + { >>>> + DR_ACCESS_STRIDE (dr)->release (); >>>> + free (dr->aux); >>>> + } >>>> + free_data_ref (dr); >>>> + } >>>> +} >>>> >>>> + >>>> + start_loop = prune_non_rectangle_loop_nest (innermost_loop, start_loop); >>>> + >>>> >>>> Hmm. If you instantiate the SCEV for the niters for each loop in the nest >>>> wrt the nest you can figure if it has any evolution in sth else than the >>>> loop (evolution_function_is_univariate_p). That is, this is not a problem >>>> until you arrive at analyzing DR strides, right? At which point you >>>> can check for the appropriate form? >>> Hmm, not really. The niter relation may not appear in SCEV of reference >>> addr. >>> For example, below loop: >>> >>> for (int i = 0; i < M; i++) //loop 1 >>> for (int j = 0; j < M; j++) //loop 2 >>> for (int k = 0; k < i; k++) //loop 3 >>> a[k][0][i] = b[k][0][i] >>> >>> There is no information in data reference about i/j loops. >>> Anyway, I refactored the code and put this check in >>> proper_loop_form_for_interchange. >>> Simpler I think. >>> >>>> >>>> + if (find_data_references_in_loop (start_loop, datarefs) == >>>> chrec_dont_know >>>> + /* Check if there is no data reference. */ >>>> + || datarefs->length () == 0 >>>> + /* Check if there are too many of data references. */ >>>> + || ((int) datarefs->length () >>>> + > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS)) >>>> + /* Check if there is any data reference in loop latch. We can't >>>> handle >>>> + loops which loop header and data references have different >>>> execution >>>> + times. */ >>>> + || dataref_niters_diff_to_loop_header (*datarefs) >>>> >>>> this suggests to do your own find_data_references_in_loop so you can early >>>> out. >>> I refactored the code a bit. Now this check is in >>> proper_loop_form_for_interchange, >>> but I do customized my own data references finder. It's needed to strip >>> outer loops >>> once a difficult reference is found. >>> >>>> >>>> Overall the flow through the pass is a bit hard to follow given there are >>>> IMHO too many functions. >>> Yeah, I removed quite number of small functions and refactor the code a >>> lot. Hope this >>> version is more straightforward. >>>> >>>> +unsigned int >>>> +pass_linterchange::execute (function *fun) >>>> +{ >>>> + if (number_of_loops (fun) <= 2) >>>> + return 0; >>>> + >>>> + bool changed_p = false;; >>>> + struct loop *loop; >>>> + vec<loop_p> loop_nest; >>>> + vec<data_reference_p> datarefs; >>>> + vec<ddr_p> ddrs; >>>> + >>>> + FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST) >>>> + if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs)) >>>> + { >>>> + tree_loop_interchange loop_interchange (loop_nest, datarefs, ddrs); >>>> + changed_p |= loop_interchange.interchange (); >>>> + } >>>> >>>> you leak datarefs/ddrs? >>> It was release in destructor, but I refactored it anyway. I will push the >>> code to branch >>> gcc.gnu.org/svn/gcc/branches/gimple-linterchange. >>> >>> Thanks again for the comment of you two. >> >> Digging into the code now... >> >> Richard. >> >>> Thanks, >>> bin >>> 2017-11-27 Bin Cheng <bin.ch...@arm.com> >>> >>> * Makefile.in (gimple-loop-interchange.o): New object file. >>> * common.opt (floop-interchange): Reuse the option from graphite. >>> * doc/invoke.texi (-floop-interchange): Ditto. New document. >>> * gimple-loop-interchange.cc: New file. >>> * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter. >>> (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter. >>> * passes.def (pass_linterchange): New pass. >>> * timevar.def (TV_LINTERCHANGE): New time var. >>> * tree-pass.h (make_pass_linterchange): New declaration. >>> * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external >>> interchange. Record IV before/after increment in new parameters. >>> * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration. >>> >>> gcc/testsuite >>> 2017-11-27 Bin Cheng <bin.ch...@arm.com> >>> >>> * gcc.dg/tree-ssa/loop-interchange-1.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-2.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-3.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-4.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-5.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-6.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-7.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-8.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-9.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-10.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-11.c: New test.