On Tue, Jun 1, 2021 at 11:21 PM Christopher M. Riedl <[email protected]> wrote:
> On Thu May 20, 2021 at 3:59 PM EDT, Maamoun TK wrote: > > On Thu, May 20, 2021 at 10:06 PM Niels Möller <[email protected]> > > wrote: > > > > > "Christopher M. Riedl" <[email protected]> writes: > > > > > > > So in total, if we assume an ideal (but impossible) zero-cost version > > > > for memxor, memxor3, and gcm_fill and avoid permutes via ISA 3.0 > vector > > > > load/stores we can only account for 11.82 cycles/block; leaving 4.97 > > > > cycles/block as an additional benefit of the combined implementation. > > > > > > One hypothesis for that gain is that we can avoid storing the aes input > > > in memory at all; instead, generated the counter values on the fly in > > > the appropriate registers. > > > > > > >> Another potential overhead is that data is stored to memory when > passed > > > >> between these functions. It seems we store a block 3 times, and > loads a > > > >> block 4 times (the additional accesses should be cache friendly, but > > > >> wills till cost some execution resources). Optimizing that seems to > need > > > >> some kind of combined function. But maybe it is sufficient to > optimize > > > >> something a bit more general than aes gcm, e.g., aes ctr? > > > > > > > > This would basically have to replace the nettle_crypt16 function call > > > > with arch-specific assembly, right? I can code this up and try it > out in > > > > the context of AES-GCM. > > > > > > Yes, something like that. If we leave the _nettle_gcm_hash unchanged > > > (with its own independent assembly implementation), and look at > > > gcm_encrypt, what we have is > > > > > > const void *cipher, nettle_cipher_func *f, > > > > > > _nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst, > src); > > > > > > It would be nice if we could replace that with a call to aes_ctr_crypt, > > > and then optimizing that would benefit both gcm and plain ctr. But it's > > > not quite that easy, because gcm unfortunately uses it's own variant of > > > ctr mode, which is why we need to pass the gcm_fill function in the > > > first place. > > > > > > So if we need separate assembly for aes_plain_ctr and aes_gcm_ctr (they > > > *might* still share some code, but they would be distinct entry > points). > > > Say we call the gcm-specific ctr function from some variant of > > > gcm_encrypt via a different function pointer. Then that gcm_encrypt > > > variant is getting a bit pointless. Maybe it's better to do > > > > > > void aes128_gcm_encrypt(...) > > > { > > > _nettle_aes128_gcm_ctr(...); > > > _nettle_gcm_hash(...); > > > } > > > > > > At least, we avoid duplicating the _gcm_hash for aes128, aes192, aes256 > > > (and any other algorithms we might want to optimize in a similar way). > > > And each of the aes assembly routines should be fairly small and easy > to > > > maintain. > > > > > > > While writing the white paper "Optimize AES-GCM for PowerPC architecture > > processors", I concluded that is the best approach to implement for > > PowerPC > > architecture, easy to maintain, avoid duplication, and perform well. > > I've separated aes_gcm encrypt/decrypt to two functions, aes_ctr and > > ghash. > > Both implemented using Power ISA v3.00 assisted with vector-scalar > > registers. > > I got 1.18 cycles/byte for gcm-aes-128 encrypt/decrypt, 1.31 cycles/byte > > for gcm-aes-192 encrypt/decrypt, and 1.44 cycles/byte for gcm-aes-256 > > encrypt/decrypt. > > Neat, did you base that on the aes-gcm combined series I posted here or > completely different/new code? > It's based on new code written to fit the paper context. _______________________________________________ nettle-bugs mailing list [email protected] http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
