On Thu May 20, 2021 at 3:59 PM EDT, Maamoun TK wrote:
> On Thu, May 20, 2021 at 10:06 PM Niels Möller <[email protected]>
> wrote:
>
> > "Christopher M. Riedl" <[email protected]> writes:
> >
> > > So in total, if we assume an ideal (but impossible) zero-cost version
> > > for memxor, memxor3, and gcm_fill and avoid permutes via ISA 3.0 vector
> > > load/stores we can only account for 11.82 cycles/block; leaving 4.97
> > > cycles/block as an additional benefit of the combined implementation.
> >
> > One hypothesis for that gain is that we can avoid storing the aes input
> > in memory at all; instead, generated the counter values on the fly in
> > the appropriate registers.
> >
> > >> Another potential overhead is that data is stored to memory when passed
> > >> between these functions. It seems we store a block 3 times, and loads a
> > >> block 4 times (the additional accesses should be cache friendly, but
> > >> wills till cost some execution resources). Optimizing that seems to need
> > >> some kind of combined function. But maybe it is sufficient to optimize
> > >> something a bit more general than aes gcm, e.g., aes ctr?
> > >
> > > This would basically have to replace the nettle_crypt16 function call
> > > with arch-specific assembly, right? I can code this up and try it out in
> > > the context of AES-GCM.
> >
> > Yes, something like that. If we leave the _nettle_gcm_hash unchanged
> > (with its own independent assembly implementation), and look at
> > gcm_encrypt, what we have is
> >
> >       const void *cipher, nettle_cipher_func *f,
> >
> >   _nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst, src);
> >
> > It would be nice if we could replace that with a call to aes_ctr_crypt,
> > and then optimizing that would benefit both gcm and plain ctr. But it's
> > not quite that easy, because gcm unfortunately uses it's own variant of
> > ctr mode, which is why we need to pass the gcm_fill function in the
> > first place.
> >
> > So if we need separate assembly for aes_plain_ctr and aes_gcm_ctr (they
> > *might* still share some code, but they would be distinct entry points).
> > Say we call the gcm-specific ctr function from some variant of
> > gcm_encrypt via a different function pointer. Then that gcm_encrypt
> > variant is getting a bit pointless. Maybe it's better to do
> >
> >   void aes128_gcm_encrypt(...)
> >   {
> >     _nettle_aes128_gcm_ctr(...);
> >     _nettle_gcm_hash(...);
> >   }
> >
> > At least, we avoid duplicating the _gcm_hash for aes128, aes192, aes256
> > (and any other algorithms we might want to optimize in a similar way).
> > And each of the aes assembly routines should be fairly small and easy to
> > maintain.
> >
>
> While writing the white paper "Optimize AES-GCM for PowerPC architecture
> processors", I concluded that is the best approach to implement for
> PowerPC
> architecture, easy to maintain, avoid duplication, and perform well.
> I've separated aes_gcm encrypt/decrypt to two functions, aes_ctr and
> ghash.
> Both implemented using Power ISA v3.00 assisted with vector-scalar
> registers.
> I got 1.18 cycles/byte for gcm-aes-128 encrypt/decrypt, 1.31 cycles/byte
> for gcm-aes-192 encrypt/decrypt, and 1.44 cycles/byte for gcm-aes-256
> encrypt/decrypt.

Neat, did you base that on the aes-gcm combined series I posted here or
completely different/new code?

>
> Still if there are additional vector registers, I would give the
> combined
> function a shot as it eliminates loading the input message twice.
>
> regards,
> Mamone

_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to