Re: [openssl.org #3042] [PATCH] Fast implementation of AES-XTS mode for AVX capable x86-64 processors

Andy Polyakov via RT Mon, 13 May 2013 10:37:30 -0700

Hi,

> This patch is a contribution to OpenSSL.
> It demonstrates an efficient implementation of AES-XTS, using Intel's
> AES-NI and AVX architecture.
> The performance improvement provided here is achieved via: a slightly
> improved reduction technique,


For reference. Criteria for instruction choice in current version was 
that it should be possible to issue more than one of instructions in 
question at a time. Instructions with throughput of 1 are scheduled to 
specific port, and rationale was to adhere to port-agnostic instructions 
in order to exclude possibility of contention on port to which aesenc 
instructions are scheduled. This is the reason for why psrad was not 
used, i.e. because its throughput is 1. Of course it might turn out that 
psrad can't clash with aesenc, because it's specific to *another* port, 
and I was worried for nothing. Can you (or anybody else) confirm that 
it's the case? See further below...

> encryption of several (here, 8) blocks
> in parallel, use of AVX, and unrolling. These ingredients can be used
> in different ways to improve the current OpenSSL performance of XTS.
>  
> The performance:
> ===============
>  
> AES-XTS performance:
> ===================
> The performance was measured by using openssl speed utility as follows:
> openssl speed -evp aes-128-xts
>  
> Single thread performance in 1000s of B/S, for 8KB buffer:   
>  
> Core i7-2600K @3.4GHz *:
>  
> OpenSSL Git[1]:     3448851.18  (0.99 Cycles/Byte)
> This patch:         4060259.32  (0.84 Cycles/Byte)
> Speedup: 1.18X
>  
> Core i7-3770  @3.4GHz **:
>  
> OpenSSL Git[1]:     3486178.83  (0.98 Cycles/Byte)
> This patch:         4072520.76 (0.83 Cycles/Byte)
> Speedup: 1.17X
>  
> *Codename "Sandy Bridge"
> **Codename "Ivy Bridge"
> 
> [1] OpenSSL Gitweb: http://git.openssl.org/gitweb 

Cool stuff! As mentioned earlier, what I'd also/still love to figure out 
is *why* performance is limited on Sandy/Ivy Bridge. Because even here I 
don't find 0.84 to be satisfactory result, as it still falls far behind 
theoretical limit, more than 25 cycles per loop revolution. I've been 
playing with more aggressive approach with regard to xors and achieved 
0.90/0.89 on Sandy/Ivy Bridge for 6x interleave factor and without AVX. 
While it might appear modest, it should be noted that the result means 
that I'm missing as little as 6 cycles per loop revolution. The referred 
code is not committed yet (I've been waiting for results for Westmere, 
which turned to be at theoretical limit:-), but idea is following. 
Pre-calculate tweak values pre-xored with zero round key, so that we 
perform only one xor on top of the loop. Then when we offload tweak 
values to stack we pre-xor them with pre-calculated 
round[0]^round[last], so that tweak can be consumed directly at 
aesenclast time. In other words we move *two* xors inside the aesenc 
loop, not one as in suggested code. Result is (as already mentioned) 6 
missing cycles on * Bridge and theoretical limit on Westmere and 
Bulldozer. I suppose combination of approaches would allow to improve 
performance even further. I've actually attempted 8x interleave without 
AVX earlier, but it performed sub-optimally on Intel CPUs (but achieved 
respectful hights on Bulldozer). In other words work is ongoing...

On additional note. One can argue that XTS result for largest block size 
is a bit mis-representative. Because in real-life block size is likely 
to be smaller and at smaller block size tweak generation takes more and 
more noticeable portion. If only there were several blocks to process so 
that it would be possible to parallelize tweak generation. But what if 
subroutine accepted additional argument, storage block size, so that 
instead of processing 512 block one after another, application could say 
"process 4KB of consecutive 512-byte blocks."


______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Re: [openssl.org #3042] [PATCH] Fast implementation of AES-XTS mode for AVX capable x86-64 processors

Reply via email to