https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> --- On Thu, 17 Jan 2019, wilco at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #14 from Wilco <wilco at gcc dot gnu.org> --- > (In reply to rguent...@suse.de from comment #13) > > > Usually the peeling is done to improve branch prediction on the > > prologue/epilogue. > > Modern branch predictors do much better on a loop than with this kind of code: > > ands x12, x11, 7 > beq .L70 > cmp x12, 1 > beq .L55 > cmp x12, 2 > beq .L57 > cmp x12, 3 > beq .L59 > cmp x12, 4 > beq .L61 > cmp x12, 5 > beq .L63 > cmp x12, 6 > bne .L72 > > That's way too many branches close together so most predictors will hit the > maximum branches per fetch block limit and not predict them. > > If you wanted to peel it would have to be like: > > if (n & 4) > do 4 iterations > if (n & 2) > do 2 iterations > if (n & 1) > do 1 iteration > > However that's still way too much code explosion for little or no gain. The > only case where this makes sense is in a handwritten memcpy. Classic peeling would be do 1 iteration if (n == 1) exit do 1 iteration if (n == 2) exit ... which is what I refered to for branch prediction. Your & prompts me to a way to do sth similar as duffs device, turning the loop into a nest. head: if (n == 0) exit; <1 iteration> if (n & 1) n -= 1, goto head; <1 iteration> if (n & 2) n -= 2, goto head; <2 iterations> if (n & 4) n -= 4, goto head; <4 iterations> n -= 8, goto head; the inner loop tests should become well-predicted quickly. But as always - more branches make it more likely that one is mispredicted. For a single non-unrolled loop you usually only get the actual exit mispredicted.