Hi, > 1) Firstly, I was surprised about current AES implementation timings. I > wrote the simplest test, containing calls to 4 core AES functions ( > AES_set_encrypt_key, AES_encrypt(), AES_set_decrypt_key(), AES_decrypt) > and measured the clocks of some OpenSSL implementations.
This was discussed couple of times already. In assembly module above subroutines use compact 256-byte S-boxes, which means that it takes extra calculations, and *more* for AES_decrypt and AES_set_decrypt_key. Therefore the results. It's conscious choice for assembly module in order to mitigate side channel attacks. There are alternatives for x86[_64] platforms [besides AES-NI], namely SSSE3 vpaes-x86[_64] and bsaes-x86_64 modules. These are accessible through EVP and provide adequate performance (in comparison to aes_core.c that is, not AES-NI). > 2) So I decided to improve AES_set_decrypt_key() and here is the patch. > The result of patched code compiled with icl is 732 cycles for 256-bit > key that is 15% faster that best current implementation. For gcc, result > is worse - 1086 cycles (8% speed-up). I also wrote an asm implementation > that is faster than icl for 128-bit keys only (patched C version does > 550 cycles and an asm version does 475) and mush faster than gcc code > for any key length. Unfortunately, I was not able to make a patch for > aes-586.pl because my assembler version actively uses all 7 registers > and there is no more register to address constant tables Te1..4, Td1..4 > etc. I address them like > mov ebp, DWORD PTR _Td2[ebp*4] > but in your code these tables are in the code segment and require one > more register. It's not the fact that tables reside in code segment that requires extra register, but the fact that the code itself is *position independent*. I mean you can 'mov ebp,_Td2[ebp*]' even if _Td2 is in code segment, but placing it in code segment won't make code position independent. And vice-versa, I'd need extra register even if table we residing elsewhere. Position independence is strong requirement and code that doesn't meet it is not really interesting. On side note, I had a quick look and it seems to be possible to improve ILP a little bit and improve decrypt performance by few percent (hopefully)... I spare the exercise for a rainy day... Use alternatives!!! (Well, it probably should be noted that bsaes relies on AES_set_[en|de]crypt_key). ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org