Segher Boessenkool <seg...@kernel.crashing.org> writes: Thanks all! > Hi! > > On Mon, Jul 06, 2020 at 03:13:13PM +0800, guojiufu wrote: >> For very small loops (< 6 insns), it would be fine to unroll 4 >> times to use cache line better. Like below loops: >> `while (i) a[--i] = NULL; while (p < e) *d++ = *p++;` > > Yes, definitely. > >> And for very complex loops which may cause negative impacts: >> branch-miss or cache-miss. Like below loop: there are calls, >> early exits and branches in loop. >> ``` >> for (int i = 0; i < n; i++) { >> int e = a[I]; >> .... >> if (function_call(e)) break; >> .... >> } >> ``` >> >> This patch enhances RTL unroll for small loops and prevent to >> unroll complex loops. > > I am not happy about what is considered "a complex loop" here. For early exit, which may cause and *next* unrolled iterations may be not executed, then unroll may be not benifit. For too many branches (this patch say 20% of insns), may cause more branch-misses even for unrolled loop.
>From my intial intuition, I once think each condition may *define* loop as complex. :-) But for each single condition, loop unrolling may still be helpful. While, if these conditions are all occur in a loop, it would be more possible to get negative impacts after unrolled. We could continue discuss and tune it, as you pointed out below. > >> +/* Count the number of call insns in LOOP. */ >> +static unsigned int >> +num_loop_calls (struct loop *loop) >> +{ >> + basic_block *bbs; >> + rtx_insn *insn; >> + unsigned int i; >> + unsigned int call_ins_num = 0; >> + >> + bbs = get_loop_body (loop); >> + for (i = 0; i < loop->num_nodes; i++) >> + FOR_BB_INSNS (bbs[i], insn) >> + if (CALL_P (insn)) >> + call_ins_num++; >> + >> + free (bbs); >> + >> + return call_ins_num; >> +} > > This function belongs in cfgloop.c really? (Or cfgloopanal.c). Next to > num_loop_branches ;-) Thanks for great sugguestion! > >> +/* Return true if LOOP is too complex to be unrolled. */ >> +static bool >> +rs6000_complex_loop_p (struct loop *loop) >> +{ >> + unsigned call_num; >> + >> + return loop->ninsns > 10 >> + && (call_num = num_loop_calls (loop)) > 0 >> + && (call_num + num_loop_branches (loop)) * 5 > loop->ninsns >> + && !single_exit (loop); >> +} > > Don't do initialisation in conditionals please (or in loop conditions), > like Will said already. Thanks! > > This calls only very specific loops "complex". We need a better way > to decide this :-( > >> static unsigned >> rs6000_loop_unroll_adjust (unsigned nunroll, struct loop *loop) >> { >> - if (unroll_only_small_loops) >> + if (unroll_only_small_loops) >> { >> - /* TODO: This is hardcoded to 10 right now. It can be refined, for >> - example we may want to unroll very small loops more times (4 perhaps). >> - We also should use a PARAM for this. */ >> + if (loop->ninsns <= 6) >> + return MIN (4, nunroll); >> if (loop->ninsns <= 10) >> return MIN (2, nunroll); >> - else >> - return 0; >> + >> + return 0; >> } > > This part is fine. It is independent of the rest AFAICS, so if you > agree, this part is preapproved for trunk. Thanks! Sure, thanks. > > (A PARAM would be nice, but too many of those isn't actually useful > either... Next time, add one as soon as writing the code, at least it > is useful at that point in time, when you still need to experiment with > it :-) ) Yes, with PARAM, we can just change param to experiment: "if (loop->ninsns <= 10) return 0;" (6insn - 10 insn) may just located n one cache line, it may not need to unroll. But, after some tunning, I chose 6/4,10/2 here. 4 hardcodes may be too many for PARAM. 2 PARAMs (one for 6, one for 10) may be ok. Any comments? > >> + if (rs6000_complex_loop_p (loop)) >> + return 0; >> + >> return nunroll; >> } > > So, we use rs6000_complex_loop_p only to prevent all unrolling, never to > reduce the unrolling, and only in very specific cases. > > Is there no middle road possible? Say, don't unroll to more than 25 > insns total (which is what the "only small loops" does, sort of -- it > also avoids unrolling 3x a bit, yes), and don't unroll to more than 2 > calls, and not to more than 4 branches (I'm making up those numbers, of > course, and PARAMS would be helpful). Some of this already does exist, > and might need retuning for us? It may make sense. There are param_max_unrolled_insns, param_max_unroll_times, param_max_peel_branches(cunrol)...; we may add calls number and branches number checking for rtl unroller. While, actually, here we would need condition to define *complex* loop, where contains call exist (may just 1), branch exist(may 2) and early exit(may 1) at the same time, but each number is not large. Any sugguestions? Thanks. > > > Segher