Re: [openssl.org #3021] [PATCH] Fast implementation of AES-CTR mode for AVX capable x86-64 processors

Andy Polyakov via RT Tue, 26 Mar 2013 06:34:52 -0700

> This patch is a contribution to OpenSSL. It offers an efficient
> implementation of AES-CTR, using Intel's AES-NI and AVX architecture.


Thanks.

> This contribution also improves the performance of AES-GCM. While
> faster AES-GCM can be achieved by interleaving the CTR and GHASH, we
> understand from [1] and [2] that the OpenSSL team prefers to
> implement the encryption and the authentication serially (and
> separately). With this as the preferred direction, a faster CTR mode
> implementation would also improve AES-GCM.

It's not matter of preference, but of common sense. Indeed, there is
lesser point to implement [and maintain] assembly code *if* it doesn't
provide substantial improvement. And the thing about interleaved GCM is
that no *published* result was observed to deliver substantially better
performance than sum of components. Even this suggestion underpins the
point, because in GCM context it surpasses your own earlier suggestion
in ticket #2900 [to which there *will* be separate response].

> The performance improvement provided in this patch is achieved by
> observing  that with a given IV, 96 bit of consecutive counter blocks
> are constant. Counter blocks are incremented only on their remaining
> 32 bits, and this can be carried out with ALU instructions. In
> addition, we note that the 96 bits are also constant after the
> initial xor, and can therefore be pre- calculated. This way, each
> counter requires only 32 bit xor (and done with ALU instructions).

Previous versions all exploit the fact that 96 bits are constant...
Conclusion I draw is that the gain is rather because the aesenc loop
gets fully unrolled and counter calculations [and xor with 0-round key]
get effectively modulo-scheduled. It should be noted that even this
improved result is rather far from theoretical limit. Sandy/Ivy Bridge
seem to suffer from some kind of anomaly when they hit a mix of loads of
aesenc and branches. I call it anomaly because other CPUs, Westmere,
Bulldozer, ..., have lesser problem approaching theoretical limit
determined by performance characteristics of instructions in play.

> AES-CTR performance:
> ===================
> The performance was measured by using openssl speed utility as follows:
> openssl speed -evp aes-128-ctr
> 
> Single thread performance in 1000s of B/S, for 8KB buffer:   
> 
> Core i7-3770  @3.4GHz **:
> 
> OpenSSL Git[1]: 4016931.84  (0.85 Cycles/Byte)
> This patch:         5021340.50 (0.68 Cycles/Byte)

I can't confirm these numbers. I think that the results are skewed by
Turbo Boost and as such are not directly representative. Ivy Bridge
numbers are very much same as Sandy Bridge ones.

> As a comparison baseline, we post OpenSSL’s AES-ECB performance. The
> CTR mode implementation of the proposed patch is faster than the
> current OpenSSL ECB. (this is obviously less-than-optimal)

But let's keep in mind that ECB is hardly used in real applications, and
its optimization has more academic value than practical. What I'd prefer
to find out is explanation for why Sandy/Ivy Bridge performs
sub-optimally on parallelizeable algorithms, so that we can assess [in
meaningful manner] how to improve algorithms that actually matter.


But back to beginning and preferences. Well, there is preference factor,
but of following nature. It's preferred that code bears as little
specific architecture dependency as possible. If some CPU-specific
optimization is not substantially faster, then it falls to
non-preferable category. I mean code that gives just few percent edge on
specific CPU is not preferred. Therefore
http://git.openssl.org/gitweb/?p=openssl.git;a=commitdiff;h=6c79faaa9dd288bfda72831a9ef22ca01fa482d4.

Special note about result of 0.77 cpb on Sandy Bridge mentioned in
commentary section. Strangely enough if I modify speed.c to run only
largest block size I measure 0.75 on my system. Which one is real?
Another mystery... Cheers.


______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Re: [openssl.org #3021] [PATCH] Fast implementation of AES-CTR mode for AVX capable x86-64 processors

Reply via email to