Re: [RFC PATCH 0/6] Introduce combined AES-GCM assembly for POWER9+

Christopher M. Riedl Mon, 05 Apr 2021 16:53:21 -0700

On Mon Apr 5, 2021 at 2:39 AM CDT, Niels Möller wrote:
> "Christopher M. Riedl" <[email protected]> writes:
>
> > An implementation combining AES+GCM _can potentially_ yield significant
> > performance boosts by allowing for increased instruction parallelism, 
> > avoiding
> > C-function call overhead, more flexibility in assembly fine-tuning, etc. 
> > This
> > series provides such an implementation based on the existing optimized 
> > Nettle
> > routines for POWER9 and later processors. Benchmark results on a POWER9
> > Blackbird running at 3.5GHz are given at the end of this mail.
>
> Benchmark results are impressive. If I get the numbers right, cycles per
> block (16 bytes) is reduced from 40 to 22.5. You can run
> nettle-benchmark with the flag -f 3.5e9 (for 3.5GHz clock frequency) to
> get cycle numbers in the output.


Hi Niels,

Your math is very close - here are benchmark results for encrypt (since
decrypt is essentially the same):

AES+GCM combined (this series)
------------------------------
         Algorithm         mode Mbyte/s cycles/byte cycles/block
        gcm_aes128      encrypt 2564.32        1.30        20.83
        gcm_aes192      encrypt 2276.86        1.47        23.46
        gcm_aes256      encrypt 2051.87        1.63        26.03

AES,GCM separate (nettle master)
--------------------------------
         Algorithm         mode Mbyte/s cycles/byte cycles/block
        gcm_aes128      encrypt 1419.17        2.35        37.63
        gcm_aes192      encrypt 1313.69        2.54        40.65
        gcm_aes256      encrypt 1218.79        2.74        43.82

So for aes128: 37.63 - 20.83 = 16.80 cycles/block improvement.

>
> I'm a bit conservative about about adding assembly code for combined
> operations, since it can lead to an explosion in the amount of code to
> maintain. So I'd like to understand a bit better where the 17.5 saved
> cycles were spent. For the code on master, gcm_encrypt (with aes) is
> built from
> these building blocks:

Makes perfect sense to me!

>
> * gcm_fill
>
> C code, essentially 2 64-bit stores per block. On little endian, it
> also needs some byte swapping.
>
> * aes_encrypt
>
> Using power assembly. Performance measured as the "aes128 ECB
> encrypt" line in nettle-benchmark output.
>
> * memxor3
>
> This is C code on power (and rather hairy C code). Performance can
> be measured with nettle-benchmark, and it's going to be a bit
> alignment dependent.
>
> * gcm_hash
>
> This uses power assembly. Performance is measured as the "gcm
> update" line in nettle-benchmark output. From your numbers, this
> seems to be 7.3 cycles per block.
>
> So before going all the way with a combined aes_gcm function, I think
> it's good to try to optimize the building blocks. Please benchmark
> memxor3, to see if it could benefit from assembly implementation. If so,
> that should give a nice speedup to several modes, not just gcm. (If you
> implement memxor3, beware that it needs to support some overlap, to not
> break in-place CBC decrypt).

The benchmark results don't convince me memxor3 and memxor are actually
a huge bottleneck by themselves. It does appear to show that my combined
implementation is dominated by the cost of AES (which matches when I run
a simple test encrypt program with the 'perf' utility):

         Algorithm         mode Mbyte/s cycles/byte cycles/block
            memxor      aligned 16634.14        0.20         1.61
            memxor    unaligned 11089.33        0.30         2.41
           memxor3      aligned 17261.19        0.19         1.55
           memxor3  unaligned01 11549.04        0.29         2.31
           memxor3  unaligned11 11181.62        0.30         2.39
           memxor3  unaligned12  8485.88        0.39         3.15
            aes128  ECB encrypt  2762.38        1.21        19.33
            aes128  ECB decrypt  2203.65        1.51        24.24

I tried a few other experiments:

 1. Replace memxor/3 with a no-op function (ie. just 'return'):
     Algorithm         mode Mbyte/s cycles/byte cycles/block
    gcm_aes128      encrypt 1553.08        2.15        34.39
    gcm_aes192      encrypt 1428.57        2.34        37.38
    gcm_aes256      encrypt 1318.05        2.53        40.52
   
    aes128: 37.63 - 34.39 = 3.24 cycles/block

 2. Replace memxor/3,gcm_fill w/ a no-op function:
     Algorithm         mode Mbyte/s cycles/byte cycles/block
    gcm_aes128      encrypt 1793.37        1.86        29.78
    gcm_aes192      encrypt 1625.74        2.05        32.85
    gcm_aes256      encrypt 1483.81        2.25        35.99

    aes128: 34.39 - 29.78 = 4.61 cycles/block

 3. Replace memxor/3, and gcm_fill w/ no-op functions and use POWER9
    instructions lxvb16x/stxvb16x to load/store unaligned vectors and
    avoid the permutes on LE:
     Algorithm         mode Mbyte/s cycles/byte cycles/block
    gcm_aes128      encrypt 2069.67        1.61        25.80
    gcm_aes192      encrypt 1875.97        1.78        28.47
    gcm_aes256      encrypt 1717.33        1.94        31.10

    aes128: 29.78 - 25.80 = 3.98 cycles/block

So in total, if we assume an ideal (but impossible) zero-cost version
for memxor, memxor3, and gcm_fill and avoid permutes via ISA 3.0 vector
load/stores we can only account for 11.82 cycles/block; leaving 4.97
cycles/block as an additional benefit of the combined implementation.
But all this assumes zero-cost implementations of these building blocks
so the improvement of the combined implementation is >5 cycles/block.
   
>
> Another potential overhead is that data is stored to memory when passed
> between these functions. It seems we store a block 3 times, and loads a
> block 4 times (the additional accesses should be cache friendly, but
> wills till cost some execution resources). Optimizing that seems to need
> some kind of combined function. But maybe it is sufficient to optimize
> something a bit more general than aes gcm, e.g., aes ctr?

This would basically have to replace the nettle_crypt16 function call
with arch-specific assembly, right? I can code this up and try it out in
the context of AES-GCM.

Thanks!
Chris R.

>
> Regards,
> /Niels
>
> --
> Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
> Internet email is subject to wholesale government surveillance.

_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Re: [RFC PATCH 0/6] Introduce combined AES-GCM assembly for POWER9+

Reply via email to