David Miller <da...@davemloft.net> writes: For values of N >= 1 we would expect 1 cycle per iteration. But that's not exactly what happens. N cycles ====================== 1 2 2 3 3 4 4 5 5 6 6 8 7 11 8 14 9 17 10 20 Things look fine until we get to N=6, the extra loop iteration seems to take 2 cycles instead of 1, and from N=7 onward the loop takes 3 cycles to execute. I've tried aligning the first instruction of the loop at different offsets, and this doesn't make any difference.
Instruction fetch starvation? Unlikely, the required instructions in the group (2) is half of the fetch size, and we're hitting the I-cache every time since my test programs time the loop multiple times. Perhaps there is something wonky with the branch predictor on these chips. Information is incredibly sparse in this area, so it's hard to say what might or might not be happening. I think they messed up "predicted taken" and "predicted non-taken" at the gate level. So for enough iterations, the predictor considers--correctly--that the branch will be taken. And then the misinterpres it. The loop branch back is fast only when it is predicted non-taken. :-) -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel