Re: ppc64 micro optimization

Niels Möller Wed, 14 Feb 2024 06:46:18 -0800

Danny Tsen <[email protected]> writes:

> Here is the new patch v4 for AES/GCM stitched implementation and
> benchmark based on the current repo.


Thanks. I'm not able to read it all carefully at the moment, but I have
a few comments, see below. 

In the mean time, I've also tried to implement something similar for
x86_64, see branch x86_64-gcm-aes. Unfortunately, I get no speedup, to
the contrary, my stitched implementation seems considerably slower...
But at least that helped me understand the higher-level issues better.

> --- a/gcm-aes128.c
> +++ b/gcm-aes128.c
> @@ -63,14 +63,30 @@ void
>  gcm_aes128_encrypt(struct gcm_aes128_ctx *ctx,
>               size_t length, uint8_t *dst, const uint8_t *src)
>  {
> -  GCM_ENCRYPT(ctx, aes128_encrypt, length, dst, src);
> +  size_t done = _gcm_aes_encrypt (&ctx->key, &ctx->gcm.x.b, &ctx->gcm.ctr.b,
> +                               _AES128_ROUNDS, &ctx->cipher.keys, length, 
> dst, src);

I know I asked you to explode the context into many separate arguments
to _gcm_aes_encrypt. I'm now backpedalling a bit on that. For one, it's
not so nice to have so many arguments that they can't be passed in
registers. Second, when running a fat build on a machine where the
needed instructions are unavailable, it's a bit of a waste to have to
spend lots of instructions on preparing those arguments for calling a
nop function. So to reduce overhead, I'm now leaning towards an
interface like

  /* To reduce the number of arguments (e.g., maximum of 6 register
     arguments on x86_64), pass a pointer to gcm_key, which really is a
     pointer to the first member of the appropriate gcm_aes*_ctx
     struct. */
  size_t
  _gcm_aes_encrypt (struct gcm_key *CTX,
                    unsigned rounds,
                    size_t size, uint8_t *dst, const uint8_t *src);

That's not so pretty, but I think that is workable and efficient, and
since it is an internal function, the interface can be changed if this
is implemented on other architectures and we find out that it needs some
tweaks. See
https://git.lysator.liu.se/nettle/nettle/-/blob/x86_64-gcm-aes/x86_64/aesni_pclmul/gcm-aes-encrypt.asm?ref_type=heads
for the code I wrote to accept that ctx argument.

It would also be nice to have a #define around the code calling
_gcm_aes_encrypt, so that it is compiled only if (i) we have an
non-trivial implementation of _gcm_aes_encrypt, or (ii) we're a fat
build, which may select a non-trivial implementation of _gcm_aes_encrypt
at run time.

> +  ctx->gcm.data_size += done;
> +  length -= done;
> +  if (length > 0) {

Not sure of the check for length > 0 is needed. It is fine to call
gcm_encrypt/GCM_ENCRYPT with length 0. There will be some overhead for a
call with length 0, though, which may be a more common case when
_gcm_aes_encrypt is used?

> +define(`SAVE_GPR', `std $1, $2(SP)')
> +define(`RESTORE_GPR', `ld $1, $2(SP)')

I think the above two macros are unneeded, it's easier to read to use
std and ld directly.

> +define(`SAVE_VR',
> +  `li r11, $2
> +   stvx $1, r11, $3')
> +define(`RESTORE_VR',
> +  `li r11, $2
> +   lvx $1, r11, $3')

It would be nice if we could trim the use of vector registers so we
don't need to save and restore lots of them. And if we need two
instructions anyway, then maybe it would be clearer with PUSH_VR/POP_VR
that also adjusts the stack pointer, and doesn't need to use an additional
register for indexing?

> +    C load table elements
> +    li             r9,1*16
> +    li             r10,2*16
> +    li             r11,3*16
> +    lxvd2x         VSR(H1M),0,HT
> +    lxvd2x         VSR(H1L),r9,HT
> +    lxvd2x         VSR(H2M),r10,HT
> +    lxvd2x         VSR(H2L),r11,HT
> +    li             r9,4*16
> +    li             r10,5*16
> +    li             r11,6*16
> +    li             r12,7*16
> +    lxvd2x         VSR(H3M),r9,HT
> +    lxvd2x         VSR(H3L),r10,HT
> +    lxvd2x         VSR(H4M),r11,HT
> +    lxvd2x         VSR(H4L),r12,HT

I think it would be nicer to follow the style I tried to implement in my
recent updates, using some registers (e.g., r9-r11) as offsets,
initializing them only once, and using everywhere. E.g., in this case,
the loading could be

    lxvd2x         VSR(H1M),0,HT
    lxvd2x         VSR(H1L),r9,HT
    lxvd2x         VSR(H2M),r10,HT
    lxvd2x         VSR(H2L),r11,HT
    addi  HT, HT, 64
    lxvd2x         VSR(H3M),0,HT
    lxvd2x         VSR(H3L),r9,HT
    lxvd2x         VSR(H4M),r10,HT
    lxvd2x         VSR(H4L),r11,HT

> +    C do two 4x ghash
> +
> +    C previous digest combining
> +    xxlor vs0, VSR(S0), VSR(S0)
> +    vxor S0,S0,D
> +
> +    GF_MUL(F2, R2, H3L, H3M, S1)
> +    GF_MUL(F, R, H4L, H4M, S0)
> +    vxor           F,F,F2
> +    vxor           R,R,R2
> +
> +    xxlor VSR(S0), vs0, vs0
> +
> +    GF_MUL(F3, R3, H2L, H2M, S2)
> +    GF_MUL(F4, R4, H1L, H1M, S3)
> +    vxor           F3,F3,F4
> +    vxor           R3,R3,R4
> +
> +    vxor           F,F,F3
> +    vxor           D,R,R3
> +    GHASH_REDUCE(D, F, POLY_L, R2, F2)  C R2, F2 used as temporaries

It may be possible to reduce number of registers, without making the
code slower, by accumulating differently. You ultimately accumulate the
values into F, D, before the final reduction. Maybe you don't need
separate registers for all of the F, F2, F3, F4 registers, and all of D,
R, R2, R3, R4?

If you have any short-lived registers used for the AES part, they could
also overlap with short-lived registers used for ghash.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Re: ppc64 micro optimization

Reply via email to