On Wed, Nov 22, 2023 at 1:50 PM Niels Möller <[email protected]> wrote:
> David Edelsohn <[email protected]> writes: > > > Calls impose a lot of overhead on Power. > > Thanks, that's good to know. > > > And both the efficient loop instruction and the preferred indirect call > > instruction use the CTR register. > > That's one thing I wonder after having a closer look at the AES loops. > > One rather common pattern in GMP and Nettle assembly loops, is to use > the same register as both index register and loop counter. A loop that > in C would conventionally be written as > > for (i = 0; i < n; i++) > dst[i] = f(src[i]); > > is written in assembly closer to > > dst += n; src += n; // Base registers point at end of arrays > n = -n; // Use negative index register > for (; n != 0; n++) > dst[n] = f(src[n]); > > This saves one register (and eliminates corresponding update > instructions), and the loop branch is based on carry flag (or zero flag) > from the index register update n++. (If the items processed by the loop > are larger than a byte, n would also be scaled by the size, and one > would do n += size rather than n++, and it still works just fine). > > Would that pattern work well on power, or is it always preferable to use > the special counter register, e.g., if it provides better branch > prediction? I'm not so familiar with power assembly, but from the AES > code it looks like the relevant instructions are mtctr to initialize the > counter, and bdnz to decrement and branch. > Calls on Power have a high overhead in general, not because of jump or return prediction, but because of the frame setup and teardown in the midst of a highly speculating and out of order core. One thinks of the processor executing the program instructions linearly, but in reality lots of instructions are in flight with lots of register renaming and lots of speculation. The setup and teardown of the frames (saving and restoring registers in the prologue and epilogue, including the link register) and confirmation that the predictions were correct before commiting the results can cause unexpected load and store conflicts in flight. MTCTR moves a GPR to the count (CTR) register. The CTR register is optimized for zero-cost countable loops with the bdnz (branch and decrement counter non zero), etc. instructions. The CTR register also is used for indirect calls (mtctr -> bctr, bcctr - branch to counter, branch conditional to counter). For indirect branches, one also can branch indirect through the linker register (mtlr -> blr), but that can corrupt the link stack internal to the processor used to predict return addresses. So one mainly has the CTR register for both loops and indirect calls. However, if one uses the count register for an indirect call, for all practical purposes, it is not available as the count register for the loop -- spilling and restoring the count register introduces too many stalls. A call inside a loop is bad. An indirect call inside a loop is doubly bad because of the call itself and because it prevents the loop from utilizing the optimal count register idiom. Thanks, David _______________________________________________ nettle-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
