On Mon, Sep 13, 2021 at 5:08 PM Niels Möller <[email protected]> wrote:
> [email protected] (Niels Möller) writes: > > > I've also added a cbc-aes128-encrypt.asm. > > That gives more significant speedup, almost 60%. I think main reason for > > the speedup is that we avoid reloading subkeys between blocks. > > I've continued this path, see branch aes-cbc. The aes128 variant is at > > > https://git.lysator.liu.se/nettle/nettle/-/blob/aes-cbc/x86_64/aesni/cbc-aes128-encrypt.asm > > Benchmark results are positive but a bit puzzling. On my laptop (AMD > Ryzen 5) I get > > aes128 ECB encrypt 5450.18 > > This is the latest version, doing two blocks per iteration. > > aes128 CBC encrypt 547.34 > > The general CBC mode written in C, with one call to aes128_encrypt per > block. 10(!) times slower than ECB. > > cbc_aes128 encrypt 865.11 > > The new assembly function. Almost 60% speedup over the old code, which > is nice, and large enough that it seems motivated to have the new > functin. But still 6 times slower than ECB. I'm not sure why. Let's look > a bit closer at cycle numbers. > > Not sure I get accurate cycle numbers (it's a bit tricky with variable > features and turbo modes and whatnot), but it looks like ECB mode is 6 > cycles per block, which would be consistent with issue of two aesenc > instructions per block. While the CBC mode is 37 cycles per block, > almost 4 cycles per aesenc. > > This could be explained if (i) latency of aesenc is 3-4 cycles, and (ii) > the processor's out-of-order machinery results in as many as 7-8 blocks > processed in parallel when executing the ECB loop, i.e., instruction > issue for 3-4 iterations through the loop before the results of the > first iteration is ready. > I did the tests on Intel Comet Lake architecture and I can't think of another explanation, it seems x86_64 arch issues multiple blocks simultaneously without hand-written unrolling of the block loop. Also, Intel processors or at least Intel Comet Lake arch implements this machinery in a more ideal way than your testing processor (AMD Ryzen 5) so you don't even need to have 2-way interleaving of AES-ECB implementation nor a separate AES-CBC implementation. I got the same benchmark speed of ECB and CBC modes for all cases with CBC mode being always 6 times slower than ECB mode. regards, Mamone _______________________________________________ nettle-bugs mailing list [email protected] http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
