Re: ppc64: AES/GCM Performance improvement with stitched implementation

David Edelsohn Wed, 22 Nov 2023 11:22:10 -0800

On Wed, Nov 22, 2023 at 1:50 PM Niels Möller <[email protected]> wrote:

> David Edelsohn <[email protected]> writes:
>
> > Calls impose a lot of overhead on Power.
>
> Thanks, that's good to know.
>
> > And both the efficient loop instruction and the preferred indirect call
> > instruction use the CTR register.
>
> That's one thing I wonder after having a closer look at the AES loops.
>
> One rather common pattern in GMP and Nettle assembly loops, is to use
> the same register as both index register and loop counter. A loop that
> in C would conventionally be written as
>
>   for (i = 0; i < n; i++)
>     dst[i] = f(src[i]);
>
> is written in assembly closer to
>
>   dst += n; src += n; // Base registers point at end of arrays
>   n = -n; // Use negative index register
>   for (; n != 0; n++)
>     dst[n] = f(src[n]);
>
> This saves one register (and eliminates corresponding update
> instructions), and the loop branch is based on carry flag (or zero flag)
> from the index register update n++. (If the items processed by the loop
> are larger than a byte, n would also be scaled by the size, and one
> would do n += size rather than n++, and it still works just fine).
>
> Would that pattern work well on power, or is it always preferable to use
> the special counter register, e.g., if it provides better branch
> prediction? I'm not so familiar with power assembly, but from the AES
> code it looks like the relevant instructions are mtctr to initialize the
> counter, and bdnz to decrement and branch.
>

Calls on Power have a high overhead in general, not because of jump or
return prediction, but because of the frame setup and teardown in the midst
of a highly speculating and out of order core.  One thinks of the processor
executing the program instructions linearly, but in reality lots of
instructions are in flight with lots of register renaming and lots of
speculation.  The setup and teardown of the frames (saving and restoring
registers in the prologue and epilogue, including the link register) and
confirmation that the predictions were correct before commiting the results
can cause unexpected load and store conflicts in flight.

MTCTR moves a GPR to the count (CTR) register.  The CTR register is
optimized for zero-cost countable loops with the bdnz (branch and decrement
counter non zero), etc. instructions.

The CTR register also is used for indirect calls (mtctr -> bctr, bcctr -
branch to counter, branch conditional to counter).  For indirect branches,
one also can branch indirect through the linker register (mtlr -> blr), but
that can corrupt the link stack internal to the processor used to predict
return addresses.  So one mainly has the CTR register for both loops and
indirect calls.  However, if one uses the count register for an indirect
call, for all practical purposes, it is not available as the count register
for the loop -- spilling and restoring the count register introduces too
many stalls.

A call inside a loop is bad.  An indirect call inside a loop is doubly bad
because of the call itself and because it prevents the loop from utilizing
the optimal count register idiom.

Thanks, David
_______________________________________________
nettle-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Re: ppc64: AES/GCM Performance improvement with stitched implementation

Reply via email to