On Mon, Sep 13, 2021 at 5:08 PM Niels Möller <[email protected]> wrote:

> [email protected] (Niels Möller) writes:
>
> > I've also added a cbc-aes128-encrypt.asm.
> > That gives more significant speedup, almost 60%. I think main reason for
> > the speedup is that we avoid reloading subkeys between blocks.
>
> I've continued this path, see branch aes-cbc. The aes128 variant is at
>
>
> https://git.lysator.liu.se/nettle/nettle/-/blob/aes-cbc/x86_64/aesni/cbc-aes128-encrypt.asm
>
> Benchmark results are positive but a bit puzzling. On my laptop (AMD
> Ryzen 5) I get
>
>             aes128  ECB encrypt 5450.18
>
> This is the latest version, doing two blocks per iteration.
>
>             aes128  CBC encrypt  547.34
>
> The general CBC mode written in C, with one call to aes128_encrypt per
> block. 10(!) times slower than ECB.
>
>         cbc_aes128      encrypt  865.11
>
> The new assembly function. Almost 60% speedup over the old code, which
> is nice, and large enough that it seems motivated to have the new
> functin. But still 6 times slower than ECB. I'm not sure why. Let's look
> a bit closer at cycle numbers.
>
> Not sure I get accurate cycle numbers (it's a bit tricky with variable
> features and turbo modes and whatnot), but it looks like ECB mode is 6
> cycles per block, which would be consistent with issue of two aesenc
> instructions per block. While the CBC mode is 37 cycles per block,
> almost 4 cycles per aesenc.
>
> This could be explained if (i) latency of aesenc is 3-4 cycles, and (ii)
> the processor's out-of-order machinery results in as many as 7-8 blocks
> processed in parallel when executing the ECB loop, i.e., instruction
> issue for 3-4 iterations through the loop before the results of the
> first iteration is ready.
>

I did the tests on Intel Comet Lake architecture and I can't think of
another explanation, it seems x86_64 arch issues multiple blocks
simultaneously without hand-written unrolling of the block loop. Also,
Intel processors or at least Intel Comet Lake arch implements this
machinery in a more ideal way than your testing processor (AMD Ryzen 5) so
you don't even need to have 2-way interleaving of AES-ECB implementation
nor a separate AES-CBC implementation. I got the same benchmark speed of
ECB and CBC modes for all cases with CBC mode being always 6 times slower
than ECB mode.

regards,
Mamone
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to