On Thu, Apr 1, 2021 at 5:21 PM Niels Möller <[email protected]> wrote:

> [email protected] (Niels Möller) writes:
>
> > (iii) I've considered doing it earlier, to make it easier to implement
> >       aes without a round loop (like for all current versions of
> >       aes-encrypt-internal.*). E.g., on x86_64, for aes128 we could load
> >       all subkeys into registers and still have registers left to do two
> >       or more blocks in parallel, but then we'd need to override
> >       aes128_encrypt separately from the other aes*_encrypt.
>
> I've given this a try, see experimental patch below. It adds a
> x86_64/aesni/aes128-encrypt.asm, with a 2-way loop. It gives a very
> modest speedup, 5%, when I benchmark on my laptop (which is now a pretty
> fast machine, AMD Ryzen 5). I've also added a cbc-aes128-encrypt.asm.
> That gives more significant speedup, almost 60%. I think main reason for
> the speedup is that we avoid reloading subkeys between blocks.
>
> If we want to go this way, I wonder how to do it without an explosion of
> files and functions. For s390x, it seems each function will be very
> small, but not so for most other archs. There are at least three modes
> that are similar to cbc encrypt in that they have to process blocks
> sequentially, with no parallelism: CBC encrypt, CMAC, and XTS (there may
> be more). It's not so nice if we need (modes × ciphers) number of assembly
> files, with lots of duplication.
>

I can think of a core function for AES-CBC mode cbc_aes_encrypt that
supplies cbc_aes128_encrypt, cbc_aes192_encrypt, and cbc_aes256_encrypt
function, now we can optimize cbc_aes_encrypt in assembly while taking care
of rounds parameter during implementing. I still prefer duplicating files
and functions for AES modes with different rounds rather than going with
this approach as I can't think of any other solution.
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to