Hi, > This patch is a contribution to OpenSSL. > It demonstrates an efficient implementation of AES-XTS, using Intel's > AES-NI and AVX architecture. > The performance improvement provided here is achieved via: a slightly > improved reduction technique,
For reference. Criteria for instruction choice in current version was that it should be possible to issue more than one of instructions in question at a time. Instructions with throughput of 1 are scheduled to specific port, and rationale was to adhere to port-agnostic instructions in order to exclude possibility of contention on port to which aesenc instructions are scheduled. This is the reason for why psrad was not used, i.e. because its throughput is 1. Of course it might turn out that psrad can't clash with aesenc, because it's specific to *another* port, and I was worried for nothing. Can you (or anybody else) confirm that it's the case? See further below... > encryption of several (here, 8) blocks > in parallel, use of AVX, and unrolling. These ingredients can be used > in different ways to improve the current OpenSSL performance of XTS. > > The performance: > =============== > > AES-XTS performance: > =================== > The performance was measured by using openssl speed utility as follows: > openssl speed -evp aes-128-xts > > Single thread performance in 1000s of B/S, for 8KB buffer: > > Core i7-2600K @3.4GHz *: > > OpenSSL Git[1]: 3448851.18 (0.99 Cycles/Byte) > This patch: 4060259.32 (0.84 Cycles/Byte) > Speedup: 1.18X > > Core i7-3770 @3.4GHz **: > > OpenSSL Git[1]: 3486178.83 (0.98 Cycles/Byte) > This patch: 4072520.76 (0.83 Cycles/Byte) > Speedup: 1.17X > > *Codename "Sandy Bridge" > **Codename "Ivy Bridge" > > [1] OpenSSL Gitweb: http://git.openssl.org/gitweb Cool stuff! As mentioned earlier, what I'd also/still love to figure out is *why* performance is limited on Sandy/Ivy Bridge. Because even here I don't find 0.84 to be satisfactory result, as it still falls far behind theoretical limit, more than 25 cycles per loop revolution. I've been playing with more aggressive approach with regard to xors and achieved 0.90/0.89 on Sandy/Ivy Bridge for 6x interleave factor and without AVX. While it might appear modest, it should be noted that the result means that I'm missing as little as 6 cycles per loop revolution. The referred code is not committed yet (I've been waiting for results for Westmere, which turned to be at theoretical limit:-), but idea is following. Pre-calculate tweak values pre-xored with zero round key, so that we perform only one xor on top of the loop. Then when we offload tweak values to stack we pre-xor them with pre-calculated round[0]^round[last], so that tweak can be consumed directly at aesenclast time. In other words we move *two* xors inside the aesenc loop, not one as in suggested code. Result is (as already mentioned) 6 missing cycles on * Bridge and theoretical limit on Westmere and Bulldozer. I suppose combination of approaches would allow to improve performance even further. I've actually attempted 8x interleave without AVX earlier, but it performed sub-optimally on Intel CPUs (but achieved respectful hights on Bulldozer). In other words work is ongoing... On additional note. One can argue that XTS result for largest block size is a bit mis-representative. Because in real-life block size is likely to be smaller and at smaller block size tweak generation takes more and more noticeable portion. If only there were several blocks to process so that it would be possible to parallelize tweak generation. But what if subroutine accepted additional argument, storage block size, so that instead of processing 512 block one after another, application could say "process 4KB of consecutive 512-byte blocks." ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List [email protected] Automated List Manager [email protected]
