On Thu, Apr 1, 2021 at 5:21 PM Niels Möller <[email protected]> wrote:
> [email protected] (Niels Möller) writes: > > > (iii) I've considered doing it earlier, to make it easier to implement > > aes without a round loop (like for all current versions of > > aes-encrypt-internal.*). E.g., on x86_64, for aes128 we could load > > all subkeys into registers and still have registers left to do two > > or more blocks in parallel, but then we'd need to override > > aes128_encrypt separately from the other aes*_encrypt. > > I've given this a try, see experimental patch below. It adds a > x86_64/aesni/aes128-encrypt.asm, with a 2-way loop. It gives a very > modest speedup, 5%, when I benchmark on my laptop (which is now a pretty > fast machine, AMD Ryzen 5). I've also added a cbc-aes128-encrypt.asm. > That gives more significant speedup, almost 60%. I think main reason for > the speedup is that we avoid reloading subkeys between blocks. > > If we want to go this way, I wonder how to do it without an explosion of > files and functions. For s390x, it seems each function will be very > small, but not so for most other archs. There are at least three modes > that are similar to cbc encrypt in that they have to process blocks > sequentially, with no parallelism: CBC encrypt, CMAC, and XTS (there may > be more). It's not so nice if we need (modes × ciphers) number of assembly > files, with lots of duplication. > I can think of a core function for AES-CBC mode cbc_aes_encrypt that supplies cbc_aes128_encrypt, cbc_aes192_encrypt, and cbc_aes256_encrypt function, now we can optimize cbc_aes_encrypt in assembly while taking care of rounds parameter during implementing. I still prefer duplicating files and functions for AES modes with different rounds rather than going with this approach as I can't think of any other solution. _______________________________________________ nettle-bugs mailing list [email protected] http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
