Segher Boessenkool <seg...@kernel.crashing.org> writes:

Thanks all!
> Hi!
>
> On Mon, Jul 06, 2020 at 03:13:13PM +0800, guojiufu wrote:
>> For very small loops (< 6 insns), it would be fine to unroll 4
>> times to use cache line better.  Like below loops:
>>  `while (i) a[--i] = NULL;   while (p < e)  *d++ = *p++;`
>
> Yes, definitely.
>
>> And for very complex loops which may cause negative impacts:
>> branch-miss or cache-miss. Like below loop: there are calls,
>> early exits and branches in loop.
>> ```
>>   for (int i = 0; i < n; i++) {
>>               int e = a[I];
>>              ....
>>               if (function_call(e))  break;
>>              ....
>>   }
>> ```
>> 
>> This patch enhances RTL unroll for small loops and prevent to
>> unroll complex loops.
>
> I am not happy about what is considered "a complex loop" here.
For early exit, which may cause and *next* unrolled iterations may be
not executed, then unroll may be not benifit.
For too many branches (this patch say 20% of insns), may cause more
branch-misses even for unrolled loop.

>From my intial intuition, I once think each condition may *define* loop
as complex. :-)

But for each single condition, loop unrolling may still be helpful.
While, if these conditions are all occur in a loop, it would be more
possible to get negative impacts after unrolled.

We could continue discuss and tune it, as you pointed out below.

>
>> +/* Count the number of call insns in LOOP.  */
>> +static unsigned int
>> +num_loop_calls (struct loop *loop)
>> +{
>> +  basic_block *bbs;
>> +  rtx_insn *insn;
>> +  unsigned int i;
>> +  unsigned int call_ins_num = 0;
>> +
>> +  bbs = get_loop_body (loop);
>> +  for (i = 0; i < loop->num_nodes; i++)
>> +    FOR_BB_INSNS (bbs[i], insn)
>> +      if (CALL_P (insn))
>> +    call_ins_num++;
>> +
>> +  free (bbs);
>> +
>> +  return call_ins_num;
>> +}
>
> This function belongs in cfgloop.c really?  (Or cfgloopanal.c).  Next to
> num_loop_branches ;-)
Thanks for great sugguestion!
>
>> +/* Return true if LOOP is too complex to be unrolled.  */
>> +static bool
>> +rs6000_complex_loop_p (struct loop *loop)
>> +{
>> +  unsigned call_num;
>> +
>> +  return loop->ninsns > 10
>> +    && (call_num = num_loop_calls (loop)) > 0
>> +    && (call_num + num_loop_branches (loop)) * 5 > loop->ninsns
>> +    && !single_exit (loop);
>> +}
>
> Don't do initialisation in conditionals please (or in loop conditions),
> like Will said already.
Thanks!
>
> This calls only very specific loops "complex".  We need a better way
> to decide this :-(
>
>>  static unsigned
>>  rs6000_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
>>  {
>> -   if (unroll_only_small_loops)
>> +  if (unroll_only_small_loops)
>>      {
>> -      /* TODO: This is hardcoded to 10 right now.  It can be refined, for
>> -     example we may want to unroll very small loops more times (4 perhaps).
>> -     We also should use a PARAM for this.  */
>> +      if (loop->ninsns <= 6)
>> +    return MIN (4, nunroll);
>>        if (loop->ninsns <= 10)
>>      return MIN (2, nunroll);
>> -      else
>> -    return 0;
>> +
>> +      return 0;
>>      }
>
> This part is fine.  It is independent of the rest AFAICS, so if you
> agree, this part is preapproved for trunk.  Thanks!
Sure, thanks.
>
> (A PARAM would be nice, but too many of those isn't actually useful
> either...  Next time, add one as soon as writing the code, at least it
> is useful at that point in time, when you still need to experiment with
> it :-) )
Yes, with PARAM, we can just change param to experiment:
"if (loop->ninsns <= 10) return 0;" (6insn - 10 insn) may just located
n one cache line, it may not need to unroll. But, after some tunning, I
chose 6/4,10/2 here. 4 hardcodes may be too many for PARAM.  2 PARAMs
(one for 6, one for 10) may be ok.  Any comments?
>
>> +  if (rs6000_complex_loop_p (loop))
>> +    return 0;
>> +
>>    return nunroll;
>>  }
>
> So, we use rs6000_complex_loop_p only to prevent all unrolling, never to
> reduce the unrolling, and only in very specific cases.
>
> Is there no middle road possible?  Say, don't unroll to more than 25
> insns total (which is what the "only small loops" does, sort of -- it
> also avoids unrolling 3x a bit, yes), and don't unroll to more than 2
> calls, and not to more than 4 branches (I'm making up those numbers, of
> course, and PARAMS would be helpful).  Some of this already does exist,
> and might need retuning for us?
It may make sense. There are param_max_unrolled_insns,
param_max_unroll_times, param_max_peel_branches(cunrol)...; we may add
calls number and branches number checking for rtl unroller.

While, actually, here we would need condition to define *complex* loop,
where contains call exist (may just 1), branch exist(may 2) and early
exit(may 1) at the same time, but each number is not large.
Any sugguestions? Thanks.

>
>
> Segher

Reply via email to