> This patch is a contribution to OpenSSL. It offers an efficient > implementation of AES-CTR, using Intel's AES-NI and AVX architecture.
Thanks. > This contribution also improves the performance of AES-GCM. While > faster AES-GCM can be achieved by interleaving the CTR and GHASH, we > understand from [1] and [2] that the OpenSSL team prefers to > implement the encryption and the authentication serially (and > separately). With this as the preferred direction, a faster CTR mode > implementation would also improve AES-GCM. It's not matter of preference, but of common sense. Indeed, there is lesser point to implement [and maintain] assembly code *if* it doesn't provide substantial improvement. And the thing about interleaved GCM is that no *published* result was observed to deliver substantially better performance than sum of components. Even this suggestion underpins the point, because in GCM context it surpasses your own earlier suggestion in ticket #2900 [to which there *will* be separate response]. > The performance improvement provided in this patch is achieved by > observing that with a given IV, 96 bit of consecutive counter blocks > are constant. Counter blocks are incremented only on their remaining > 32 bits, and this can be carried out with ALU instructions. In > addition, we note that the 96 bits are also constant after the > initial xor, and can therefore be pre- calculated. This way, each > counter requires only 32 bit xor (and done with ALU instructions). Previous versions all exploit the fact that 96 bits are constant... Conclusion I draw is that the gain is rather because the aesenc loop gets fully unrolled and counter calculations [and xor with 0-round key] get effectively modulo-scheduled. It should be noted that even this improved result is rather far from theoretical limit. Sandy/Ivy Bridge seem to suffer from some kind of anomaly when they hit a mix of loads of aesenc and branches. I call it anomaly because other CPUs, Westmere, Bulldozer, ..., have lesser problem approaching theoretical limit determined by performance characteristics of instructions in play. > AES-CTR performance: > =================== > The performance was measured by using openssl speed utility as follows: > openssl speed -evp aes-128-ctr > > Single thread performance in 1000s of B/S, for 8KB buffer: > > Core i7-3770 @3.4GHz **: > > OpenSSL Git[1]: 4016931.84 (0.85 Cycles/Byte) > This patch: 5021340.50 (0.68 Cycles/Byte) I can't confirm these numbers. I think that the results are skewed by Turbo Boost and as such are not directly representative. Ivy Bridge numbers are very much same as Sandy Bridge ones. > As a comparison baseline, we post OpenSSL’s AES-ECB performance. The > CTR mode implementation of the proposed patch is faster than the > current OpenSSL ECB. (this is obviously less-than-optimal) But let's keep in mind that ECB is hardly used in real applications, and its optimization has more academic value than practical. What I'd prefer to find out is explanation for why Sandy/Ivy Bridge performs sub-optimally on parallelizeable algorithms, so that we can assess [in meaningful manner] how to improve algorithms that actually matter. But back to beginning and preferences. Well, there is preference factor, but of following nature. It's preferred that code bears as little specific architecture dependency as possible. If some CPU-specific optimization is not substantially faster, then it falls to non-preferable category. I mean code that gives just few percent edge on specific CPU is not preferred. Therefore http://git.openssl.org/gitweb/?p=openssl.git;a=commitdiff;h=6c79faaa9dd288bfda72831a9ef22ca01fa482d4. Special note about result of 0.77 cpb on Sandy Bridge mentioned in commentary section. Strangely enough if I modify speed.c to run only largest block size I measure 0.75 on my system. Which one is real? Another mystery... Cheers. ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List [email protected] Automated List Manager [email protected]
