On Wed, Aug 30, 2017 at 6:32 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener > <richard.guent...@gmail.com> wrote: >> On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>> Hi, >>> This patch implements a simple loop interchange pass in GCC, as described >>> by its comments: >>> +/* This pass performs loop interchange: for example, the loop nest >>> + >>> + for (int j = 0; j < N; j++) >>> + for (int k = 0; k < N; k++) >>> + for (int i = 0; i < N; i++) >>> + c[i][j] = c[i][j] + a[i][k]*b[k][j]; >>> + >>> + is transformed to >>> + >>> + for (int i = 0; i < N; i++) >>> + for (int j = 0; j < N; j++) >>> + for (int k = 0; k < N; k++) >>> + c[i][j] = c[i][j] + a[i][k]*b[k][j]; >>> + >>> + This pass implements loop interchange in the following steps: >>> + >>> + 1) Find perfect loop nest for each innermost loop and compute data >>> + dependence relations for it. For above example, loop nest is >>> + <loop_j, loop_k, loop_i>. >>> + 2) From innermost to outermost loop, this pass tries to interchange >>> + each loop pair. For above case, it firstly tries to interchange >>> + <loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>. >>> + Then it tries to interchange <loop_j, loop_i> and loop nest becomes >>> + <loop_i, loop_j, loop_k>. The overall effect is to move innermost >>> + loop to the outermost position. For loop pair <loop_i, loop_j> >>> + to be interchanged, we: >>> + 3) Check if data dependence relations are valid for loop interchange. >>> + 4) Check if both loops can be interchanged in terms of transformation. >>> + 5) Check if interchanging the two loops is profitable. >>> + 6) Interchange the two loops by mapping induction variables. >>> + >>> + This pass also handles reductions in loop nest. So far we only support >>> + simple reduction of inner loop and double reduction of the loop nest. >>> */ >>> >>> Actually, this pass only does loop shift which outermosting inner loop to >>> outer, rather >>> than permutation. Also as a traditional loop optimizer, it only works for >>> perfect loop >>> nest. I put it just after loop distribution thus ideally loop >>> split/distribution could >>> create perfect nest for it. Unfortunately, we don't get any perfect nest >>> from distribution >>> for now because it only works for innermost loop. For example, the >>> motivation case in >>> spec2k6/bwaves is not handled on this pass alone. I have a patch extending >>> distribution >>> for (innermost) loop nest and with that patch bwaves case can be handled. >>> Another point is I deliberately make both the cost model and code >>> transformation (very) >>> conservative. We can support more cases, or more transformations with >>> great care when >>> it is for sure known beneficial. IMHO, we already hit over-baked issues >>> quite often and >>> don't want to introduce more. >>> As for code generation, this patch has an issue that invariant code in >>> outer loop could >>> be moved to inner loop. For the moment, we rely on the last lim pass to >>> handle all INV >>> generated during interchange. In the future, we may need to avoid that in >>> interchange >>> itself, or another lim pass just like the one after graphite optimizations. >>> >>> Boostrap and test on x86_64 and AArch64. Various benchmarks built and run >>> successfully. >>> Note this pass is disabled in patch, while the code is exercised by >>> bootstrap/building >>> programs with it enabled by default. Any comments? >> > Thanks for quick review. >> +/* The same as above, but this one is only used for interchanging not >> + innermost loops. */ >> +#define OUTER_STRIDE_RATIO (2) >> >> please make all these knobs --params. >> >> +/* Enum type for loop reduction variable. */ >> +enum reduction_type >> +{ >> + UNKNOWN_RTYPE = 0, >> + SIMPLE_RTYPE, >> + DOUBLE_RTYPE >> +}; >> >> seeing this we should have some generic data structure / analysis for >> reduction detection. This adds a third user (autopar and vectorizer >> are the others). Just an idea. >> >> +/* Return true if E is abnormal edge. */ >> + >> +static inline bool >> +abnormal_edge (edge e) >> +{ >> + return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP)); >> +} >> >> bad name/comment for what it does. >> >> ... jumping to end of file / start of pass >> >> + /* Get the outer loop. */ >> + loop = superloop_at_depth (loop, loop_depth (loop) - 1); >> >> loop_outer (loop)? >> >> + /* Only support rectangle loop nest, i.e, inner loop's niters doesn't >> + depends on outer loop's IV. */ >> + if (chrec_contains_symbols_defined_in_loop (niters, loop->num)) >> + break; >> >> but you don't check for a three-nest whether niters depends on outer outer >> loop's IV that way. Either the check is superfluous here or incomplete. > It is checked for multi-nest case in can_interchange_loops. I will > move the check to this function so that we can save compilation time. >> >> + /* Check if start_loop forms a perfect loop nest. */ >> + loop_nest->create (3); >> + do { >> + datarefs->create (20); >> + ddrs->create (20); >> + loop_nest->truncate (0); >> + if (compute_data_dependences_for_loop (start_loop, false, loop_nest, >> + datarefs, ddrs)) >> + { >> + unsigned i; >> + struct data_dependence_relation *ddr; >> + >> + for (i = 0; ddrs->iterate (i, &ddr); ++i) >> + { >> + if (DDR_ARE_DEPENDENT (ddr) == chrec_known) >> + continue; >> + /* Give up on any unknown dependence. */ >> + if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know >> + || DDR_NUM_DIR_VECTS (ddr) == 0) >> + break; >> + } >> + >> + if (i == ddrs->length ()) >> + return true; >> >> better open-code this so we dont' waste time computing all dependences >> when we give up in the majority of cases (unknown dependence). > Right. Will make the change. >> Memleak here (ddrs and datarefs). >> >> + start_loop = start_loop->inner; >> + } while (start_loop && start_loop->inner); >> >> ick. So this is cubic -- nest depth * #drs * #drs ... (exactly why I >> never committed loop distribution for nests ;)). > Hmm, loop distribution for (innermost) nest is necessary for this pass > to handle bwaves unfortunately. It is also necessary to distribute > for (i;) > for (j) > arr[i * len + j] = 0; > into a single memcall, rather than a loop of sub-memcall.
I know... bwaves is what I implemented the change for to enable interchange. I think right now store-motion makes the nest difficult to handle as it has a reduction (but I see you handle reductions in interchange -- just this one cannot be handled I think?). Richard. >> >> I see that should_interchange_loops only uses datarefs. This means >> I'd rather do that as a very first step before considering validity >> (and computing dependences). That analysis (for all possible >> interchanges) should be much cheaper? I see it probably doesn't > Yes it makes the code less straightforward. I think we can > additionally call should_interchange_loops before computing > dependencies. This will bypass for most cases, for example, ~3500 > loops can be interchanged without cost model, while only ~200 loops > actually pass the cost model. > >> fit with the iteration you do very well... can't we somehow compute >> a loop permutation and apply it in a single step rather than >> piecewise with update_data_refs/deps? > Arbitrary loop permutation would be non-trivial. If you meant > computing loop shit (outermosting here) and transforming the code once > for all, I think it's doable. Cost model check can be easily extended > in that way, though transformation part needs quite lot rework. One > of my concern is, other than matrix-mul, I don't have motivation case > for multi-nest interchange so far. >> >> valid_data_dependences has almost no comments, it would be nice >> to add some (overall) one(s). > Sorry, I just realized it's wrong because direct vector is misused for > distance vector... Surprised it didn't bite with thousands of > interchanged loops in spec... Will fix it. > >> >> Thanks, >> Richard. >> >> >>> Thanks, >>> bin >>> 2017-08-29 Bin Cheng <bin.ch...@arm.com> >>> >>> * Makefile.in (tree-ssa-loop-interchange.o): New object file. >>> * common.opt (ftree-loop-interchange): New option. >>> * doc/invoke.texi (-ftree-loop-interchange): Document new option. >>> * passes.def (pass_linterchange): New pass. >>> * timevar.def (TV_LINTERCHANGE): New time var. >>> * tree-pass.h (make_pass_linterchange): New declaration. >>> * tree-ssa-loop-interchange.cc: New file. >>> * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external. >>> * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration. >>> >>> gcc/testsuite >>> 2017-08-29 Bin Cheng <bin.ch...@arm.com> >>> >>> * gcc.dg/tree-ssa/loop-interchange-1.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-2.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-3.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-4.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-5.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-6.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-7.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-8.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-9.c: New test. >>> * gcc.dg/tree-ssa/loop-interchange-10.c: New test.