https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 17 Jan 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #14 from Wilco <wilco at gcc dot gnu.org> ---
> (In reply to rguent...@suse.de from comment #13)
> 
> > Usually the peeling is done to improve branch prediction on the
> > prologue/epilogue.
> 
> Modern branch predictors do much better on a loop than with this kind of code:
> 
>         ands    x12, x11, 7
>         beq     .L70
>         cmp     x12, 1
>         beq     .L55
>         cmp     x12, 2
>         beq     .L57
>         cmp     x12, 3
>         beq     .L59
>         cmp     x12, 4
>         beq     .L61
>         cmp     x12, 5
>         beq     .L63
>         cmp     x12, 6
>         bne     .L72
> 
> That's way too many branches close together so most predictors will hit the
> maximum branches per fetch block limit and not predict them.
> 
> If you wanted to peel it would have to be like:
> 
> if (n & 4)
>   do 4 iterations
> if (n & 2)
>   do 2 iterations
> if (n & 1)
>   do 1 iteration
> 
> However that's still way too much code explosion for little or no gain. The
> only case where this makes sense is in a handwritten memcpy.

Classic peeling would be

  do 1 iteration
  if (n == 1) exit
  do 1 iteration
  if (n == 2) exit
...

which is what I refered to for branch prediction.  Your & prompts me
to a way to do sth similar as duffs device, turning the loop into a nest.

 head:
   if (n == 0) exit;
   <1 iteration>
   if (n & 1)
     n -= 1, goto head;
   <1 iteration>
   if (n & 2)
     n -= 2, goto head;
   <2 iterations>
   if (n & 4)
     n -= 4, goto head;
   <4 iterations>
   n -= 8, goto head;

the inner loop tests should become well-predicted quickly.

But as always - more branches make it more likely that one is
mispredicted.  For a single non-unrolled loop you usually only
get the actual exit mispredicted.

Reply via email to