On Wed, Mar 31, 2021 at 9:18 PM Niels Möller <[email protected]> wrote:

> The reason it makes sense to me to split
> aes-encrypt.c, is that:
>
>   (i) It's more consistent with the other aes-related functions.
>
>  (ii) The current aes-encrypt.c contains both the encryption functions
>       aes128_encrypt, aes192_encrypt, aes256_encrypt, which we'd want to
>       override with assembly implementations, and the legacy wrapper
>       function aes_encrypt, which shouldn't be overridden. So we can't
>       use plain file-level override, but need #ifdefs too.
>
> (iii) I've considered doing it earlier, to make it easier to implement
>       aes without a round loop (like for all current versions of
>       aes-encrypt-internal.*). E.g., on x86_64, for aes128 we could load
>       all subkeys into registers and still have registers left to do two
>       or more blocks in parallel, but then we'd need to override
>       aes128_encrypt separately from the other aes*_encrypt.
>
> I've tried out a split, see below patch. It's a rather large change,
> moving pieces to new places, but nothing difficult. I'm considering
> committing this to the s390x branch, what do you think?
>

I agree, I'll modify the patch of basic AES-128 optimized functions to be
built on top of the splitted aes functions.

Regarding the large number of functions for s390x, I'm not yet convinced
> we should have all of them, we'll have to consider the tradeoff between
> speedup and complexity case by case. In particular, cbc encrypt (but not
> decrypt!) is notoriously slow, since it's inherently serial. So I'm
> curious about potential speedup there.

Before getting too far, it may also be worthwhile to try out an assembly
> memxor.
>

memxor performs the same in C and assembly since s390 architecture offers
memory xor instruction "xc" see xor_len macro in machine.m4 of the original
patch for an implementation example.
However, s390x AES accelerators offer considerable speedup against C
implementation with optimized internal AES. The following table
demonstrates the idea more clearly:

Function               S390x accelerator   C implementation with optimized
internal AES (Only enable aes128.asm, aes192.asm, aes256.asm)
-------------------------------------------------------------------------------------------------------------------------------
CBC AES128 Encrypt  1.073569 cpb  13.674891 cpb
CBC AES128 Decrypt  0.647008 cpb  3.131405 cpb
CBC AES192 Encrypt  1.266316 cpb  13.183552 cpb
CBC AES192 Decrypt  0.622058 cpb  3.074917 cpb
CBC AES256 Encrypt  1.450422 cpb  14.380789 cpb
CBC AES256 Decrypt  0.648403 cpb  3.040746 cpb
CFB AES128 Encrypt  1.199716 cpb  15.116906 cpb
CFB AES128 Decrypt  1.205567 cpb  3.144538 cpb
CFB AES192 Encrypt  1.393276 cpb  15.340453 cpb
CFB AES192 Decrypt  1.415399 cpb  3.064844 cpb
CFB AES256 Encrypt  1.687762 cpb  15.876734 cpb
CFB AES256 Decrypt  1.677147 cpb  3.065851 cpb
CFB8 AES128 Encrypt 17.278379 cpb 178.117195 cpb
CFB8 AES128 Decrypt 17.327002 cpb 183.136198 cpb
CFB8 AES192 Encrypt 20.408311 cpb 184.028411 cpb
CFB8 AES192 Decrypt 20.397928 cpb 187.534654 cpb
CFB8 AES256 Encrypt 23.549944 cpb 184.800598 cpb
CFB8 AES256 Decrypt 23.367348 cpb 190.355030 cpb
CMAC AES128 Update  1.026380 cpb  12.108085 cpb
CMAC AES256 Update  1.399747 cpb  11.497727 cpb
CCM AES128 Encrypt  1.828593 cpb  15.332434 cpb
CCM AES128 Decrypt  1.691520 cpb  14.115167 cpb
CCM AES128 Update   1.027736 cpb  10.918015 cpb
CCM AES192 Encrypt  1.883996 cpb  15.840703 cpb
CCM AES192 Decrypt  1.950362 cpb  14.478925 cpb
CCM AES192 Update   1.213858 cpb  11.239195 cpb
CCM AES256 Encrypt  2.206957 cpb  15.861586 cpb
CCM AES256 Decrypt  2.311447 cpb  15.051353 cpb
CCM AES256 Update   1.404938 cpb  11.441472 cpb
CTR AES128 Crypt    0.710237 cpb  4.767290 cpb
CTR AES192 Crypt    0.635386 cpb  3.489661 cpb
CTR AES256 Crypt    0.628296 cpb  3.138727 cpb
XTS AES128 Encrypt  0.655454 cpb  15.757406 cpb
XTS AES128 Decrypt  0.656113 cpb  15.920863 cpb
XTS AES256 Encrypt  0.663048 cpb  16.689253 cpb
XTS AES256 Decrypt  0.676298 cpb  16.670889 cpb
GCM AES128 Encrypt  0.630504 cpb  15.473187 cpb
GCM AES128 Decrypt  0.627714 cpb  15.529209 cpb
GCM AES128 Update   0.514662 cpb  11.608726 cpb
GCM AES192 Encrypt  0.642785 cpb  15.245804 cpb
GCM AES192 Decrypt  0.631627 cpb  15.511039 cpb
GCM AES192 Update   0.499630 cpb  11.745876 cpb
GCM AES256 Encrypt  0.631046 cpb  15.400776 cpb
GCM AES256 Decrypt  0.622329 cpb  15.419954 cpb
GCM AES256 Update   0.499630 cpb  11.569789 cpb

Also, the optimized AES cores for s390x could serve as a good reference for
other crypto libraries since they have clean and well-documented assembly
implementation.
The only drawback I can see is spamming preprocessor conditions in C files
of AES modes to support fat build for those accelerators which is worth it
IMO considering the speed gain we get.

regards,
Mamone
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to